About Dan Luu

A blog about programming and the programming industry. Vancouver, BC

The RSS's url is : https://danluu.com/atom.xml

Please copy to your reader or subscribe it with :

Preview of RSS feed of Dan Luu

How web bloat impacts users with slow devices

2024-03-16 08:00:00

In 2017, we looked at how web bloat affects users with slow connections. Even in the U.S., many users didn't have broadband speeds, making much of the web difficult to use. It's still the case that many users don't have broadband speeds, both inside and outside of the U.S. and that much of the modern web isn't usable for people with slow internet, but the exponential increase in bandwidth (Nielsen suggests this is 50% per year for high-end connections) has outpaced web bloat for typical sites, making this less of a problem than it was in 2017, although it's still a serious problem for people with poor connections.

CPU performance for web apps hasn't scaled nearly as quickly as bandwidth so, while more of the web is becoming accessible to people with low-end connections, more of the web is becoming inaccessible to people with low-end devices even if they have high-end connections. For example, if I try browsing a "modern" Discourse-powered forum on a Tecno Spark 8C, it sometimes crashes the browser. Between crashes, on measuring the performance, the responsiveness is significantly worse than browsing a BBS with an 8 MHz 286 and a 1200 baud modem. On my 1Gbps home internet connection, the 2.6 MB compressed payload size "necessary" to load message titles is relatively light. The over-the-wire payload size has "only" increased by 1000x, which is dwarfed by the increase in internet speeds. But the opposite is true when it comes to CPU speeds — for web browsing and forum loading performance, the 8-core (2 1.6 GHz Cortex-A75 / 6 1.6 GHz Cortex-A55) CPU can't handle Discourse. The CPU is something like 100000x faster than our 286. Perhaps a 1000000x faster device would be sufficient.

For anyone not familiar with the Tecno Spark 8C, today, a new Tecno Spark 8C, a quick search indicates that one can be hand for USD 50-60 in Nigeria and perhaps USD 100-110 in India. As a fraction of median household income, that's substantially more than a current generation iPhone in the U.S. today.

By worldwide standards, the Tecno Spark 8C isn't even close to being a low-end device, so we'll also look at performance on an Itel P32, which is a lower end device (though still far from the lowest-end device people are using today). Additionally, we'll look at performance with an M3 Max Macbook (14-core), an M1 Pro Macbook (8-core), and the M3 Max set to 10x throttling in Chrome dev tools. In order to give these devices every advantage, we'll be on fairly high-speed internet (1Gbps, with a WiFi router that's benchmarked as having lower latency under load than most of its peers). We'll look at some blogging platforms and micro-blogging platforms (this blog, Substack, Medium, Ghost, Hugo, Tumblr, Mastodon, Twitter, Threads, Bluesky, Patreon), forum platforms (Discourse, Reddit, Quora, vBulletin, XenForo, phpBB, and myBB), and platforms commonly used by small businesses (Wix, Squarespace, Shopify, and WordPress again).

In the table below, every row represents a website and every non-label column is a metric. After the website name column, we have the compressed size transferred over the wire (wire) and the raw, uncompressed, size (raw). Then we have, for each device, Largest Contentful Paint* (LCP*) and CPU usage on the main thread (CPU). Google's docs explain LCP as

Largest Contentful Paint (LCP) measures when a user perceives that the largest content of a page is visible. The metric value for LCP represents the time duration between the user initiating the page load and the page rendering its primary content

LCP is a common optimization target because it's presented as one of the primary metrics in Google PageSpeed Insights, a "Core Web Vital" metric. There's an asterisk next to LCP as used in this document because, LCP as measured by Chrome is about painting a large fraction of the screen, as opposed to the definition above, which is about content. As sites have optimized for LCP, it's not uncommon to have a large paint (update) that's completely useless to the user, with the actual content of the page appearing well after the LCP. In cases where that happens, I've used the timestamp when useful content appears, not the LCP as defined by when a large but useless update occurs. The full details of the tests and why these metrics were chosen are discussed in an appendix.

Although CPU time isn't a "Core Web Vital", it's presented here because it's a simple metric that's highly correlated with my and other users' perception of usability on slow devices. See appendix for more detailed discussion on this. One reason CPU time works as a metric is that, if a page has great numbers for all other metrics but uses a ton of CPU time, the page is not going to be usable on a slow device. If it takes 100% CPU for 30 seconds, the page will be completely unusable for 30 seconds, and if it takes 50% CPU for 60 seconds, the page will be barely usable for 60 seconds, etc. Another reason it works is that, relative to commonly used metrics, it's hard to cheat on CPU time and make optimizations that significantly move the number without impacting user experience.

The color scheme in the table below is that, for sizes, more green = smaller / fast and more red = larger / slower. Extreme values are in black.

SiteSizeM3 MaxM1 ProM3/10Tecno S8CItel P32
wirerawLCP*CPULCP*CPULCP*CPULCP*CPULCP*CPU
danluu.com6kB18kB50ms20ms50ms30ms0.2s0.3s0.4s0.3s0.5s0.5s
HN11kB50kB0.1s30ms0.1s30ms0.3s0.3s0.5s0.5s0.7s0.6s
MyBB0.1MB0.3MB0.3s0.1s0.3s0.1s0.6s0.6s0.8s0.8s2.1s1.9s
phpBB0.4MB0.9MB0.3s0.1s0.4s0.1s0.7s1.1s1.7s1.5s4.1s3.9s
WordPress1.4MB1.7MB0.2s60ms0.2s80ms0.7s0.7s1s1.5s1.2s2.5s
WordPress (old)0.3MB1.0MB80ms70ms90ms90ms0.4s0.9s0.7s1.7s1.1s1.9s
XenForo0.3MB1.0MB0.4s0.1s0.6s0.2s1.4s1.5s1.5s1.8sFAILFAIL
Ghost0.7MB2.4MB0.1s0.2s0.2s0.2s1.1s2.2s1s2.4s1.1s3.5s
vBulletin1.2MB3.4MB0.5s0.2s0.6s0.3s1.1s2.9s4.4s4.8s13s16s
Squarespace1.9MB7.1MB0.1s0.4s0.2s0.4s0.7s3.6s14s5.1s16s19s
Mastodon3.8MB5.3MB0.2s0.3s0.2s0.4s1.8s4.7s2.0s7.6sFAILFAIL
Tumblr3.5MB7.1MB0.7s0.6s1.1s0.7s1.0s7.0s14s7.9s8.7s8.7s
Quora0.6MB4.9MB0.7s1.2s0.8s1.3s2.6s8.7sFAILFAIL19s29s
Bluesky4.8MB10MB1.0s0.4s1.0s0.5s5.1s6.0s8.1s8.3sFAILFAIL
Wix7.0MB21MB2.4s1.1s2.5s1.2s18s11s5.6s10sFAILFAIL
Substack1.3MB4.3MB0.4s0.5s0.4s0.5s1.5s4.9s14s14sFAILFAIL
Threads9.3MB13MB1.5s0.5s1.6s0.7s5.1s6.1s6.4s16s28s66s
Twitter4.7MB11MB2.6s0.9s2.7s1.1s5.6s6.6s12s19s24s43s
Shopify3.0MB5.5MB0.4s0.2s0.4s0.3s0.7s2.3s10s26sFAILFAIL
Discourse2.6MB10MB1.1s0.5s1.5s0.6s6.5s5.9s15s26sFAILFAIL
Patreon4.0MB13MB0.6s1.0s1.2s1.2s1.2s14s1.7s31s9.1s45s
Medium1.2MB3.3MB1.4s0.7s1.4s1s2s11s2.8s33s3.2s63s
Reddit1.7MB5.4MB0.9s0.7s0.9s0.9s6.2s12s1.2sFAILFAIL

At a first glance, the table seems about right, in that the sites that feel slow unless you have a super fast device show up as slow in the table (as in, max(LCP*,CPU)) is high on lower-end devices). When I polled folks about what platforms they thought would be fastest and slowest on our slow devices (Mastodon, Twitter, Threads), they generally correctly predicted that Wordpress and Ghost would be faster than Substack and Medium, and that Discourse would be much slower than old PHP forums like phpBB, XenForo, and vBulletin. I also pulled Google PageSpeed Insights (PSI) scores for pages (not shown) and the correlation isn't as strong with those numbers because a handful of sites have managed to optimize their PSI scores without actually speeding up their pages for users.

If you've never used a low-end device like this, the general experience is that many sites are unusable on the device and loading anything resource intensive (an app or a huge website) can cause crashes. Doing something too intense in a resource intensive app can also cause crashes. While reviews note that you can run PUBG and other 3D games with decent performance on a Tecno Spark 8C, this doesn't mean that the device is fast enough to read posts on modern text-centric social media platforms or modern text-centric web forums. While 40fps is achievable in PUBG, we can easily see less than 0.4fps when scrolling on these sites.

We can see from the table how many of the sites are unusable if you have a slow device. All of the pages with 10s+ CPU are a fairly bad experience even after the page loads. Scrolling is very jerky, frequently dropping to a few frames per second and sometimes well below. When we tap on any link, the delay is so long that we can't be sure if our tap actually worked. If we tap again, we can get the dreaded situation where the first tap registers, which then causes the second tap to do the wrong thing, but if we wait, we often end up waiting too long because the original tap didn't actually register (or it registered, but not where we thought it did). Although MyBB doesn't serve up a mobile site and is penalized by Google for not having a mobile friendly page, it's actually much more usable on these slow mobiles than all but the fastest sites because scrolling and tapping actually work.

Another thing we can see is how much variance there is in the relative performance on different devices. For example, comparing an M3/10 and a Tecno Spark 8C, for danluu.com and Ghost, an M3/10 gives a halfway decent approximation of the Tecno Spark 8C (although danluu.com loads much too quickly), but the Tecno Spark 8C is about three times slower (CPU) for Medium, Substack, and Twitter, roughly four times slower for Reddit and Discourse, and over an order of magnitude faster for Shopify. For Wix, the CPU approximation is about accurate, but our `Tecno Spark 8C is more than 3 times slower on LCP*. It's great that Chrome lets you conveniently simulate a slower device from the convenience of your computer, but just enabling Chrome's CPU throttling (or using any combination of out-of-the-box options that are available) gives fairly different results than we get on many real devices. The full reasons for this are beyond the scope of the post; for the purposes of this post, it's sufficient to note that slow pages are often super-linearly slow as devices get slower and that slowness on one page doesn't strongly predict slowness on another page.

If take a site-centric view instead of a device-centric view, another way to look at it is that sites like Discourse, Medium, and Reddit, don't use all that much CPU on our fast M3 and M1 computers, but they're among the slowest on our Tecno Spark 8C (Reddit's CPU is shown as because, no matter how long we wait with no interaction, Reddit uses ~90% CPU). Discourse also sometimes crashed the browser after interacting a bit or just waiting a while. For example, one time, the browser crashed after loading Discourse, scrolling twice, and then leaving the device still for a minute or two. For consistency's sake, this wasn't marked as FAIL in the table since the page did load but, realistically, having a page so resource intensive that the browser crashes is a significantly worse user experience than any of the FAIL cases in the table. When we looked at how web bloat impacts users with slow connections, we found that much of the web was unusable for people with slow connections and slow devices are no different.

Another pattern we can see is how the older sites are, in general, faster than the newer ones, with sites that (visually) look like they haven't been updated in a decade or two tending to be among the fastest. For example, MyBB, the least modernized and oldest looking forum is 3.6x / 5x faster (LCP* / CPU) than Discourse on the M3, but on the Tecno Spark 8C, the difference is 19x / 33x and, given the overall scaling, it seems safe to guess that the difference would be even larger on the Itel P32 if Discourse worked on such a cheap device.

Another example is Wordpress (old) vs. newer, trendier, blogging platforms like Medium and Substack. Wordpress (old) is is 17.5x / 10x faster (LCP* / CPU) than Medium and 5x / 7x faster (LCP* / CPU) faster than Substack on our M3 Max, and 4x / 19x and 20x / 8x faster, respectively, on our Tecno Spark 8C. Ghost is a notable exception to this, being a modern platform (launched a year after Medium) that's competitive with older platforms (modern Wordpress is also arguably an exception, but many folks would probably still consider that to be an old platform). Among forums, NodeBB also seems to be a bit of an exception (see appendix for details).

Sites that use modern techniques like partially loading the page and then dynamically loading the rest of it, such as Discourse, Reddit, and Substack, tend to be less usable than the scores in the table indicate. Although, in principle, you could build such a site in a simple way that works well with cheap devices but, in practice sites that use dynamic loading tend to be complex enough that the sites are extremely janky on low-end devices. It's generally difficult or impossible to scroll a predictable distance, which means that users will sometimes accidentally trigger more loading by scrolling too far, causing the page to lock up. Many pages actually remove the parts of the page you scrolled past as you scroll; all such pages are essentially unusable. Other basic web features, like page search, also generally stop working. Pages with this kind of dynamic loading can't rely on the simple and fast ctrl/command+F search and have to build their own search. How well this works varies (this used to work quite well in Google docs, but for the past few months or maybe a year, it takes so long to load that I have to deliberately wait after opening a doc to avoid triggering the browser's useless built in search; Discourse search has never really worked on slow devices or even not very fast but not particular slow devices).

In principle, these modern pages that burn a ton of CPU when loading could be doing pre-work that means that later interactions on the page are faster and cheaper than on the pages that do less up-front work (this is a common argument in favor of these kinds of pages), but that's not the case for pages tested, which are slower to load initially, slower on subsequent loads, and slower after they've loaded.

To understand why the theoretical idea that doing all this work up-front doesn't generally result in a faster experience later, this exchange between a distinguished engineer at Google and one of the founders of Discourse (and CEO at the time) is illustrative, in a discussion where the founder of Discourse says that you should test mobile sites on laptops with throttled bandwidth but not throttled CPU:

When someone asks the founder of Discourse, "just wondering why you hate them", he responds with a link that cites the Kraken and Octane benchmarks from this Anandtech review, which have the Qualcomm chip at 74% and 85% of the performance of the then-current Apple chip, respectively.

The founder and then-CEO of Discourse considers Qualcomm's mobile performance embarrassing and finds this so offensive that he thinks Qualcomm engineers should all lose their jobs for delivering 74% to 85% of the performance of Apple. Apple has what I consider to be an all-time great performance team. Reasonable people could disagree on that, but one has to at least think of them as a world-class team. So, producing a product with 74% to 85% of an all-time-great team is considered an embarrassment worthy of losing your job.

There are two attitudes on display here which I see in a lot of software folks. First, that CPU speed is infinite and one shouldn't worry about CPU optimization. And second, that gigantic speedups from hardware should be expected and the only reason hardware engineers wouldn't achieve them is due to spectacular incompetence, so the slow software should be blamed on hardware engineers, not software engineers. Donald Knuth expressed a similar sentiment in

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multiithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write. Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX ... I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years. Even if I knew enough about such methods to write about them in TAOCP, my time would be largely wasted, because soon there would be little reason for anybody to read those parts ... The machine I use today has dual processors. I get to use them both only when I’m running two independent jobs at the same time; that’s nice, but it happens only a few minutes every week.

In the case of Discourse, a hardware engineer is an embarrassment not deserving of a job if they can't hit 90% of the performance of an all-time-great performance team but, as a software engineer, delivering 3% the performance of a non-highly-optimized application like MyBB is no problem. In Knuth's case, hardware engineers gave programmers a 100x performance increase every decade for decades with little to no work on the part of programmers. The moment this slowed down and programmers had to adapt to take advantage of new hardware, hardware engineers were "all out of ideas", but learning a few "new" (1970s and 1980s era) ideas to take advantage of current hardware would be a waste of time. And we've previously discussed Alan Kay's claim that hardware engineers are "unsophisticated" and "uneducated" and aren't doing "real engineering" and how we'd get a 1000x speedup if we listened to Alan Kay's "sophisticated" ideas.

It's fairly common for programmers to expect that hardware will solve all their problems, and then, when that doesn't happen, pass the issue onto the user, explaining why the programmer needn't do anything to help the user. A question one might ask is how much performance improvement programmers have given us. There are cases of algorithmic improvements that result in massive speedups but, as we noted above, Discourse, the fastest growing forum software today, seems to have given us an approximately 1000000x slowdown in performance.

Another common attitude on display above is the idea that users who aren't wealthy don't matter. When asked if 100% of users are on iOS, the founder of Discourse says "The influential users who spend money tend to be, I’ll tell you that". We see the same attitude all over comments on Tonsky's JavaScript Bloat post, with people expressing cocktail-party sentiments like "Phone apps are hundreds of megs, why are we obsessing over web apps that are a few megs? Starving children in Africa can download Android apps but not web apps? Come on" and "surely no user of gitlab would be poor enough to have a slow device, let's be serious" (paraphrased for length).

But when we look at the size of apps that are downloaded in Africa, we see that people who aren't on high-end devices use apps like Facebook Lite (a couple megs) and commonly use apps that are a single digit to low double digit number of megabytes. There are multiple reasons app makers care about their app size. One is just the total storage available on the phone; if you watch real users install apps, they often have to delete and uninstall things to put a new app on, so the smaller size is both easier to to install and has a lower chance of being uninstalled when the user is looking for more space. Another is that, if you look at data on app size and usage (I don't know of any public data on this; please pass it along if you have something public I can reference), when large apps increase the size and memory usage, they get more crashes, which drives down user retention, growth, and engagement and, conversely, when they optimize their size and memory usage, they get fewer crashes and better user retention, growth, and engagement.

Alex Russell points out that iOS has 7% market share in India (a 1.4B person market) and 6% market share in Latin America (a 600M person market). Although the founder of Discourse says that these aren't "influential users" who matter, these are still real human beings. Alex further points out that, according to Windows telemetry, which covers the vast majority of desktop users, most laptop/desktop users are on low-end machines which are likely slower than a modern iPhone.

On the bit about no programmers having slow devices, I know plenty of people who are using hand-me-down devices that are old and slow. Many of them aren't even really poor; they just don't see why (for example) their kid needs a super fast device, and they don't understand how much of the modern web works poorly on slow devices. After all, the "slow" device can play 3d games and (with the right OS) compile codebases like Linux or Chromium, so why shouldn't the device be able to interact with a site like gitlab?

Contrary to the claim from the founder of Discourse that, within years, every Android user will be on some kind of super fast Android device, it's been six years since his comment and it's going to be at least a decade before almost everyone in the world who's using a phone has a high-speed device and this could easily take two decades or more. If you look up marketshare stats for Discourse, it's extremely successful; it appears to be the fastest growing forum software in the world by a large margin. The impact of having the fastest growing forum software in the world created by an organization whose then-leader was willing to state that he doesn't really care about users who aren't "influential users who spend money", who don't have access to "infinite CPU speed", is that a lot of forums are now inaccessible to people who don't have enough wealth to buy a device with effectively infinite CPU.

If the founder of Discourse were an anomaly, this wouldn't be too much of a problem, but he's just verbalizing the implicit assumptions a lot of programmers have, which is why we see that so many modern websites are unusable if you buy the income-adjusted equivalent of a new, current generation, iPhone in a low-income country.

Thanks to Yossi Kreinen, Fabian Giesen, John O'Nolan, Joseph Scott, Loren McIntyre, Daniel Filan, @acidshill, Alex Russell, Chris Adams, Tobias Marschner, Matt Stuchlik, @[email protected], Justin Blank, Andy Kelley, Julian Lam, Matthew Thomas, avarcat, @[email protected], William Ehlhardt, Philip R. Boulain, and David Turner for comments/corrections/discussion.

Appendix: gaming LCP

We noted above that we used LCP* and not LCP. This is because LCP basically measures when the largest change happens. When this metric was not deliberately gamed in ways that don't benefit the user, this was a great metric, but this metric has become less representative of the actual user experience as more people have gamed it. In the less blatant cases, people do small optimizations that improve LCP but barely improve or don't improve the actual user experience.

In the more blatant cases, developers will deliberately flash a very large change on the page as soon as possible, generally a loading screen that has no value to the user (actually negative value because doing this increases the total amount of work done and the total time it takes to load the page) and then they carefully avoid making any change large enough that any later change would get marked as the LCP.

For the same reason that VW didn't publicly discuss how it was gaming its emissions numbers, developers tend to shy away from discussing this kind of LCP optimization in public. An exception to this is Discourse, where they publicly announced this kind of LCP optimization, with comments from their devs and the then-CTO (now CEO), noting that their new "Discourse Splash" feature hugely reduced LCP for sites after they deployed it. And then developers ask why their LCP is high, the standard advice from Discourse developers is to keep elements smaller than the "Discourse Splash", so that the LCP timestamp is computed from this useless element that's thrown up to optimize LCP, as opposed to having the timestamp be computed from any actual element that's relevant to the user. Here's a typical, official, comment from Discourse

If your banner is larger than the element we use for the "Introducing Discourse Splash - A visual preloader displayed while site assets load" you gonna have a bad time for LCP.

The official response from Discourse is that you should make sure that your content doesn't trigger the LCP measurement and that, instead, our loading animation timestamp is what's used to compute LCP.

The sites with the most extreme ratio of LCP of useful content vs. Chrome's measured LCP were:

Although we haven't discussed the gaming of other metrics, it appears that some websites also game other metrics and "optimize" them even when this has no benefit to users.

Appendix: the selfish argument for optimizing sites

This will depend on the scale of the site as well as its performance, but when I've looked at this data for large companies I've worked for, improving site and app performance is worth a mind boggling amount of money. It's measurable in A/B tests and it's also among the interventions that has, in long-term holdbacks, a relatively large impact on growth and retention (many interventions test well but don't look as good long term, whereas performance improvements tend to look better long term).

Of course you can see this from the direct numbers, but you can also implicitly see this in a lot of ways when looking at the data. One angle is that (just for example), at Twitter, user-observed p99 latency was about 60s in India as well as a number of African countries (even excluding relatively wealthy ones like Egypt and South Africa) and also about 60s in the United States. Of course, across the entire population, people have faster devices and connections in the United States, but in every country, there are enough users that have slow devices or connections that the limiting factor is really user patience and not the underlying population-level distribution of devices and connections. Even if you don't care about users in Nigeria or India and only care about U.S. ad revenue, improving performance for low-end devices and connections has enough of impact that we could easily see the impact in global as well as U.S. revenue in A/B tests, especially in long-term holdbacks. And you also see the impact among users who have fast devices since a change that improves the latency for a user with a "low-end" device from 60s to 50s might improve the latency for a user with a high-end device from 5s to 4.5s, which has an impact on revenue, growth, and retention numbers as well.

For a variety of reasons that are beyond the scope of this doc, this kind of boring, quantifiable, growth and revenue driving work has been difficult to get funded at most large companies I've worked for relative to flash product work that ends up showing little to no impact in long-term holdbacks.

Appendix: designing for low performance devices

When using slow devices or any device with low bandwidth and/or poor connectivity, the best experiences, by far, are generally the ones that load a lot of content at once into a static page. If the images have proper width and height attributes and alt text, that's very helpful. Progressive images (as in progressive jpeg) isn't particularly helpful.

On a slow device with high bandwidth, any lightweight, static, page works well, and lightweight dynamic pages can work well if designed for performance. Heavy, dynamic, pages are doomed unless the page weight doesn't cause the page to be complex.

With low bandwidth and/or poor connectivity, lightweight pages are fine. With heavy pages, the best experience I've had is when I trigger a page load, go do something else, and then come back when it's done (or at least the HTML and CSS are done). I can then open each link I might want to read in a new tab, and then do something else while I wait for those to load.

A lot of the optimizations that modern websites do, such as partial loading that causes more loading when you scroll down the page, and the concomitant hijacking of search (because the browser's built in search is useless if the page isn't fully loaded) causes the interaction model that works to stop working and makes pages very painful to interact with.

Just for example, a number of people have noted that Substack performs poorly for them because it does partial page loads. Here's a video by @acidshill of what it looks like to load a Substack article and then scroll on an iPhone 8, where the post has a fairly fast LCP, but if you want to scroll past the header, you have to wait 6s for the next page to load, and then on scrolling again, you have to wait maybe another 1s to 2s:

As an example of the opposite approach, I tried loading some fairly large plain HTML pages, such as https://danluu.com/diseconomies-scale/ (0.1 MB wire / 0.4 MB raw) and https://danluu.com/threads-faq/ (0.4 MB wire / 1.1 MB raw) and these were still quite usable for me even on slow devices. 1.1 MB seems to be larger than optimal and breaking that into a few different pages would be better on a low-end devices, but a single page with 1.1 MB of text works much better than most modern sites on a slow device. While you can get into trouble with HTML pages that are so large that browsers can't really handle them, for pages with a normal amount of content, it generally isn't until you have complex CSS payloads or JS that the pages start causing problems for slow devices. Below, we test pages that are relatively simple, some of which have a fair amount of media (14 MB in one case) and find that these pages work ok, as long as they stay simple.

Chris Adams has also noted that blind users, using screen readers, often report that dynamic loading makes the experience much worse for them. Like dynamic loading to improve performance, while this can be done well, it's often either done badly or bundled with so much other complexity that the result is worse than a simple page.

@Qingcharles noted another accessibility issue — the (prison) parolees he works with are given "lifeline" phones, which are often very low end devices. From a quick search, in 2024, some people will get an iPhone 6 or an iPhone 8, but there are also plenty of devices that are lower end than an Itel P32, let alone a Tecno Spark 8C. They also get plans with highly limited data, and then when they run out, some people "can't fill out any forms for jobs, welfare, or navigate anywhere with Maps".

For sites that do up-front work and actually give you a decent experience on low end devices, Andy Kelley pointed out an example of a site that does up front work that seems to work ok on a slow device (although it would struggle on a very slow connection), the Zig standard library documentation:

I made the controversial decision to have it fetch all the source code up front and then do all the content rendering locally. In theory, this is CPU intensive but in practice... even those old phones have really fast CPUs!

On the Tecno Spark 8C, this uses 4.7s of CPU and, afterwards, is fairly responsive (relative to the device — of course an iPhone responds much more quickly. Taps cause links to load fairly quickly and scrolling also works fine (it's a little jerky, but almost nothing is really smooth on this device). This seems like the kind of thing people are referring to when they say that you can get better performance if you ship a heavy payload, but there aren't many examples of that which actually improve performance on low-end devices.

Appendix: articles on web performance issues

Appendix: empathy for non-rich users

Something I've observed over time, as programming has become more prestigious and more lucrative, is that people have tended to come from wealthier backgrounds and have less exposure to people with different income levels. An example we've discussed before, is at a well-known, prestigious, startup that has a very left-leaning employee base, where everyone got rich, on a discussion about the covid stimulus checks, in a slack discussion, a well meaning progressive employee said that it was pointless because people would just use their stimulus checks to buy stock. This person had, apparently, never talked to any middle-class (let alone poor) person about where their money goes or looked at the data on who owns equity. And that's just looking at American wealth. When we look at world-wide wealth, the general level of understanding is much lower. People seem to really underestimate the dynamic range in wealth and income across the world. From having talked to quite a few people about this, a lot of people seem to have mental buckets for "poor by American standards" (buys stock with stimulus checks) and "poor by worldwide standards" (maybe doesn't even buy stock), but the range of poverty in the world dwarfs the range of poverty in America to an extent that not many wealthy programmers seem to realize.

Just for example, in this discussion how lucky I was (in terms of financial opportunities) that my parents made it to America, someone mentioned that it's not that big a deal because they had great financial opportunities in Poland. For one thing, with respect to the topic of the discussion, the probability that someone will end up with a high-paying programming job (senior staff eng at a high-paying tech company) or equivalent, I suspect that, when I was born, being born poor in the U.S. gives you better odds than being fairly well off in Poland, but I could believe the other case as well if presented with data. But if we're comparing Poland v. U.S. to Vietnam v. U.S., if I spend 15 seconds looking up rough wealth numbers for these countries in the year I was born, the GDP/capita ratio of U.S. : Poland was ~8:1, whereas it was ~50 : 1 for Poland : Vietnam. The difference in wealth between Poland and Vietnam was roughly the square of the difference between the U.S. and Poland, so Poland to Vietnam is roughly equivalent to Poland vs. some hypothetical country that's richer than the U.S. by the amount that the U.S. is richer than Poland. These aren't even remotely comparable, but a lot of people seem to have this mental model that there's "rich countries" and "not rich countries" and "not rich countries" are all roughly in the same bucket. GDP/capita isn't ideal, but it's easier to find than percentile income statistics; the quick search I did also turned up that annual income in Vietnam then was something like $200-$300 a year. Vietnam was also going through the tail end of a famine whose impacts are a bit difficult to determine because statistics here seem to be gamed, but if you believe the mortality rate statistics, the famine caused total overall mortality rate to jump to double the normal baseline1.

Of course, at the time, the median person in a low-income country wouldn't have had a computer, let alone internet access. But, today it's fairly common for people in low-income countries to have devices. Many people either don't seem to realize this or don't understand what sorts of devices a lot of these folks use.

Appendix: comments from Fabian Giesen

On the Discourse founder's comments on iOS vs. Android marketshare, Fabian notes

In the US, according to the most recent data I could find (for 2023), iPhones have around 60% marketshare. In the EU, it's around 33%. This has knock-on effects. Not only do iOS users skew towards the wealthier end, they also skew towards the US.

There's some secondary effects from this too. For example, in the US, iMessage is very popular for group chats etc. and infamous for interoperating very poorly with Android devices in a way that makes the experience for Android users very annoying (almost certainly intentionally so).

In the EU, not least because Android is so much more prominent, iMessage is way less popular and anecdotally, even iPhone users among my acquaintances who would probably use iMessage in the US tend to use WhatsApp instead.

Point being, globally speaking, recent iOS + fast Internet is even more skewed towards a particular demographic than many app devs in the US seem to be aware.

And on the comment about mobile app vs. web app sizes, Fabian said:

One more note from experience: apps you install when you install them, and generally have some opportunity to hold off on updates while you're on a slow or metered connection (or just don't have data at all).

Back when I originally got my US phone, I had no US credit history and thus had to use prepaid plans. I still do because it's fine for what I actually use my phone for most of the time, but it does mean that when I travel to Germany once a year, I don't get data roaming at all. (Also, phone calls in Germany cost me $1.50 apiece, even though T-Mobile is the biggest mobile provider in Germany - though, of course, not T-Mobile US.)

Point being, I do get access to free and fast Wi-Fi at T-Mobile hotspots (e.g. major train stations, airports etc.) and on inter-city trains that have them, but I effectively don't have any data plan when in Germany at all.

This is completely fine with mobile phone apps that work offline and sync their data when they have a connection. But web apps are unusable while I'm not near a public Wi-Fi.

Likewise I'm fine sending an email over a slow metered connection via the Gmail app, but I for sure wouldn't use any web-mail client that needs to download a few MBs worth of zipped JS to do anything on a metered connection.

At least with native app downloads, I can prepare in advance and download them while I'm somewhere with good internet!

Another comment from Fabian (this time paraphrased since this was from a conversation), is that people will often justify being quantitatively hugely slower because there's a qualitative reason something should be slow. One example he gave was that screens often take a long time to sync their connection and this is justified because there are operations that have to be done that take time. For a long time, these operations would often take seconds. Recently, a lot of displays sync much more quickly because Nvidia specifies how long this can take for something to be "G-Sync" certified, so display makers actually do this in a reasonable amount of time now. While it's true that there are operations that have to be done that take time, there's no fundamental reason they should take as much time as they often used to. Another example he gave was on how someone was justifying how long it took to read thousands of files because the operation required a lot of syscalls and "syscalls are slow", which is a qualitatively true statement, but if you look at the actual cost of a syscall, in the case under discussion, the cost of a syscall was many orders of magnitude from being costly enough to be a reasonable explanation for why it took so long to read thousands of files.

On this topic, when people point out that a modern website is slow, someone will generally respond with the qualitative defense that the modern website has these great features, which the older website is lacking. And while it's true that (for example) Discourse has features that MyBB doesn't, it's hard to argue that its feature set justifies being 33x slower.

Appendix: experimental details

With the exception of danluu.com and, arguably, HN, for each site, I tried to find the "most default" experience. For example, for WordPress, this meant a demo blog with the current default theme, twentytwentyfour. In some cases, this may not be the most likely thing someone uses today, e.g., for Shopify, I looked at the first thing that theme they give you when you browse their themes, but I didn't attempt to find theme data to see what the most commonly used theme is. For this post, I wanted to do all of the data collection and analysis as a short project, something that takes less than a day, so there were a number of shortcuts like this, which will be described below. I don't think it's wrong to use the first-presented Shopify theme in a decent fraction of users will probably use the first-presente theme, but that is, of course, less representative than grabbing whatever the most common theme is and then also testing many different sites that use that theme to see how real-world performance varies when people modify the theme for their own use. If I worked for Shopify or wanted to do competitive analysis on behalf of a competitor, I would do that, but for a one-day project on how large websites impact users on low-end devices, the performance of Shopify demonstrated here seems ok. I actually did the initial work for this around when I ran these polls, back in February; I just didn't have time to really write this stuff up for a month.

For the tests on laptops, I tried to have the laptop at ~60% battery, not plugged in, and the laptop was idle for enough time to return to thermal equilibrium in a room at 20°C, so pages shouldn't be impacted by prior page loads or other prior work that was happening on the machine.

For the mobile tests, the phones were at ~100% charge and plugged in, and also previously at 100% charge so the phones didn't have any heating effect you can get from rapidly charging. As noted above, these tests were formed with 1Gbps WiFi. No other apps were running, the browser had no other tabs open, and the only apps that were installed on the device, so no additional background tasks should've been running other than whatever users are normally subject to by the device by default. A real user with the same device is going to see worse performance than we measured here in almost every circumstance except if running Chrome Dev Tools on a phone significantly degrades performance. I noticed that, on the Itel P32, scrolling was somewhat jerkier with Dev Tools running than when running normally but, since this was a one-day project, I didn't attempt to quantify this and if it impacts some sites much more than others. In absolute terms, the overhead can't be all that large because the fastest sites are still fairly fast with Dev Tools running, but if there's some kind of overhead that's super-linear in the amount of work the site does (possibly indirectly, if it causes some kind of resource exhaustion), then that could be a problem in measurements of some sites.

Sizes were all measured on mobile, so in cases where different assets are loaded on mobile vs. desktop, the we measured the mobile asset sizes. CPU was measured as CPU time on the main thread (I did also record time on other threads for sites that used other threads, but didn't use this number; if CPU were a metric people wanted to game, time on other threads would have to be accounted for to prevent sites from trying to offload as much work as possible to other threads, but this isn't currently an issue and time on main thread is more directly correlated to usability than sum of time across all threads, and the metric that would work for gaming is less legible with no upside for now).

For WiFi speeds, speed tests had the following numbers:

One thing to note is that the Itel P32 doesn't really have the ability to use the bandwidth that it nominally has. Looking at the top Google reviews, none of them mention this. The first review reads

Performance-wise, the phone doesn’t lag. It is powered by the latest Android 8.1 (GO Edition) ... we have 8GB+1GB ROM and RAM, to run on a power horse of 1.3GHz quad-core processor for easy multi-tasking ... I’m impressed with the features on the P32, especially because of the price. I would recommend it for those who are always on the move. And for those who take battery life in smartphones has their number one priority, then P32 is your best bet.

The second review reads

Itel mobile is one of the leading Africa distributors ranking 3rd on a continental scale ... the light operating system acted up to our expectations with no sluggish performance on a 1GB RAM device ... fairly fast processing speeds ... the Itel P32 smartphone delivers the best performance beyond its capabilities ... at a whooping UGX 330,000 price tag, the Itel P32 is one of those amazing low-range like smartphones that deserve a mid-range flag for amazing features embedded in a single package.

The third review reads

"Much More Than Just a Budget Entry-Level Smartphone ... Our full review after 2 weeks of usage ... While switching between apps, and browsing through heavy web pages, the performance was optimal. There were few lags when multiple apps were running in the background, while playing games. However, the overall performance is average for maximum phone users, and is best for average users [screenshot of game] Even though the game was skipping some frames, and automatically dropped graphical details it was much faster if no other app was running on the phone.

Notes on sites:

Another kind of testing would be to try to configure pages to look as similar as possible. I'd be interested in seeing that results for that if anyone does it, but that test would be much more time consuming. For one thing, it requires customizing each site. And for another, it requires deciding what sites should look like. If you test something danluu.com-like, every platform that lets you serve up something light straight out of a CDN, like Wordpress and Ghost, should score similarly, with the score being dependent on the CDN and the CDN cache hit rate. Sites like Medium and Substack, which have relatively little customizability would score pretty much as they do here. Realistically, from looking at what sites exist, most users will create sites that are slower than the "most default" themes for Wordpress and Ghost, although it's plausible that readers of this blog would, on average, do the opposite, so you'd probably want to test a variety of different site styles.

Appendix: this site vs. sites that don't work on slow devices or slow connections

Just as an aside, something I've found funny for a long time is that I get quite a bit of hate mail about the styling on this page (and a similar volume of appreciation mail). By hate mail, I don't mean polite suggestions to change things, I mean the equivalent of road rage, but for web browsing; web rage. I know people who run sites that are complex enough that they're unusable by a significant fraction of people in the world. How come people are so incensed about the styling of this site and, proportionally, basically don't care at all that the web is unusable for so many people?

Another funny thing here is that the people who appreciate the styling generally appreciate that the site doesn't override any kind of default styling, letting you make the width exactly what you want (by setting your window size how you want it) and it also doesn't override any kind of default styling you apply to sites. The people who are really insistent about this want everyone to have some width limit they prefer, some font they prefer, etc., but it's always framed in a way as if they don't want it, it's really for the benefit of people at large even though accommodating the preferences of the web ragers would directly oppose the preferences of people who prefer (just for example) to be able to adjust the text width by adjusting their window width.

Until I pointed this out tens of times, this iteration would usually start with web ragers telling me that "studies show" that narrower text width is objectively better, but on reading every study that exists on the topic that I could find, I didn't find this to be the case. Moreover, on asking for citations, it's clear that people saying this generally hadn't read any studies on this at all and would sometimes hastily send me a study that they did not seem to have read. When I'd point this out, people would then change their argument to how studies can't really describe the issue (odd that they'd cite studies in the first place), although one person cited a book to me (which I read and they, apparently, had not since it also didn't support their argument) and then move to how this is what everyone wants, even though that's clearly not the case, both from the comments I've gotten as well as the data I have from when I made the change.

Web ragers who have this line of reasoning generally can't seem to absorb the information that their preferences are not universal and will insist that they regardless of what people say they like, which I find fairly interesting. On the data, when I switched from Octopress styling (at the time, the most popular styling for programming bloggers) to the current styling, I got what appeared to be a causal increase in traffic and engagement, so it appears that not only do people who write me appreciation mail about the styling like the styling, the overall feeling of people who don't write to me appears to be that the site is fine and apparently more appealing than standard programmer blog styling. When I've noted this, people tend to become become further invested in the idea that their preferences are universal and that people who think they have other preferences are wrong and reply with total nonsense.

For me, two questions I'm curious about are why do people feel the need to fabricate evidence on this topic (referring to studies when they haven't read any, googling for studies and then linking to one that says the opposite of what they claim it says, presumably because they didn't really read it, etc.) in order to claim that there are "objective" reasons their preferences are universal or correct, and why are people so much more incensed by this than by the global accessibility problems caused by typical web design? On the latter, I suspect if you polled people with an abstract survey, they would rate global accessibility to be a larger problem, but by revealed preference both in terms of what people create as well as what irritates them enough to send hate mail, we can see that having fully-adjustable line width and not capping line width at their preferred length is important to do something about whereas global accessibility is not. As noted above, people who run sites that aren't accessible due to performance problems generally get little to no hate mail about this. And when I use a default Octopress install, I got zero hate mail about this. Fewer people read my site at the time, but my traffic volume hasn't increased by a huge amount since then and the amount of hate mail I get about my site design has gone from zero to a fair amount, an infinitely higher ratio than the increase in traffic.

To be clear, I certainly wouldn't claim that the design on this site is optimal. I just removed the CSS from the most popular blogging platform for programmers at the time because that CSS seemed objectively bad for people with low-end connections and, as a side effect, got more traffic and engagement overall, not just from locations where people tend to have lower end connections and devices. No doubt a designer who cares about users on low-end connections and devices could do better, but there's something quite odd about both the untruthfulness and the vitriol of comments on this.


  1. This estimate puts backwards-looking life expectancy in the low 60s; that paper also discusses other estimates in the mid 60s and discusses biases in the estimates. [return]

Retrospective Thoughts on BitC

2024-02-24 08:00:00

This is an archive of the BitC retrospective by Jonathan Shapiro that seems to have disappeared from the internet

Jonathan S. Shapiro shap at eros-os.org
Fri Mar 23 15:06:41 PDT 2012

By now it will be obvious to everyone that I have stopped work on BitC. An explanation of why seems long overdue.

One answer is that work on Coyotos stopped when I joined Microsoft, and the work that I am focused on now doesn't really require (or seem to benefit from) BitC. As we all know, there is only so much time to go around in our lives. But that alone wouldn't have stopped me entirely.

A second answer is that BitC isn't going to work in its current form. I had hit a short list of issues that required a complete re-design of the language and type system followed by a ground-up new implementation. Experience with the first implementation suggested that this would take quite a while, and it was simply more than I could afford to take on without external support and funding. Programming language work is not easy to fund.

But the third answer may of greatest interest, which is that I no longer believe that type classes "work" in their current form from the standpoint of language design. That's the only important science lesson here.

In the large, there were four sticking points for the current design:

  1. The compilation model.
  2. The insufficiency of the current type system w.r.t. by-reference and reference types.
  3. The absence of some form of inheritance.
  4. The instance coherence problem.

The first two issues are in my opinion solvable, thought the second requires a nearly complete re-implementation of the compiler. The last (instance coherence) does not appear to admit any general solution, and it raises conceptual concerns about the use of type classes for method overload in my mind. It's sufficiently important that I'm going to deal with the first three topics here and take up the last as a separate note.

Inheritance is something that people on the BitC list might (and sometimes have) argue about strongly. So a few brief words on the subject may be relevant.

Prefacing Comments on Objects, Inheritance, and Purity

BitC was initially designed as an [imperative] functional language because of our focus on software verification. Specification of the typing and semantics of functional languages is an area that has a lot of people working on it. We (as a field) kind of know how to do it, and it was an area where our group at Hopkins didn't know very much when we started. Software verification is a known-hard problem, doing it over an imperative language was already a challenge, and this didn't seem like a good place for a group of novice language researchers to buck the current trends in the field. Better, it seemed, to choose our battles. We knew that there were interactions between inheritance and inference, and it appeared that type classes with clever compilation could achieve much of the same operational results. I therefore decided early not to include inheritance in the language.

To me, as a programmer, the removal of inheritance and objects was a very reluctant decision, because it sacrificed any possibility of transcoding the large body of existing C++ code into a safer language. And as it turns out, you can't really remove the underlying semantic challenges from a successful systems language. A systems language requires some mechanism for existential encapsulation. The mechanism which embodies that encapsulation isn't really the issue; once you introduce that sort of encapsulation, you bring into play most of the verification issues that objects with subtyping bring into play, and once you do that, you might as well gain the benefit of objects. The remaining issue, in essence, is the modeling of the Self type, and for a range of reasons it's fairly essential to have a Self type in a systems language once you introduce encapsulation. So you end up pushed in to an object type system at some point in any case. With the benefit of eight years of hindsight, I can now say that this is perfectly obvious!

I'm strongly of the opinion that multiple inheritance is a mess. The argument pro or con about single inheritance still seems to me to be largely a matter of religion. Inheritance and virtual methods certainly aren't the only way to do encapsulation, and they may or may not be the best primitive mechanism. I have always been more interested in getting a large body of software into a safe, high-performance language than I am in innovating in this area of language design. If transcoding current code is any sort of goal, we need something very similar to inheritance.

The last reason we left objects out of BitC initially was purity. I wanted to preserve a powerful, pure subset language - again to ease verification. The object languages that I knew about at the time were heavily stateful, and I couldn't envision how to do a non-imperative object-oriented language. Actually, I'm still not sure I can see how to do that practically for the kinds of applications that are of interest for BitC. But as our faith in the value of verification declined, my personal willingness to remain restricted by purity for the sake of verification decayed quickly.

The other argument for a pure subset language has to do with advancing concurrency, but as I really started to dig in to concurrency support in BitC, I came increasingly to the view that this approach to concurrency isn't a good match for the type of concurrent problems that people are actually trying to solve, and that the needs and uses for non-mutable state in practice are a lot more nuanced than the pure programming approach can address. Pure subprograms clearly play an important role, but they aren't enough.

And I still don't believe in monads. :-)

Compilation Model

One of the objectives for BitC was to obtain acceptable performance under a conventional, static separate compilation scheme. It may be short-sighted on my part, but complex optimizations at run-time make me very nervous from the standpoint of robustness and assurance. I understand that bytecode virtual machines today do very aggressive optimizations with considerable success, but there are a number of concerns with this:

To be clear, I'm not opposed to continuous compilation. I actually think it's a good idea, and I think that there are some fairly compelling use-cases. I do think that the run-time optimizer should be implemented in a strongly typed, safe language. I also think that it took an awfully long time for the hotspot technology to stabilize, and that needs to be taken as a cautionary tale. It's also likely that many of the problems/concerns that I have enumerated can be solved - but probably not * soon*. For the applications that are most important to me, the concerns about assurance are primary. So from a language design standpoint, I'm delighted to exploit continuous compilation, but I don't want to design a language that requires continuous compilation in order to achieve reasonable baseline performance.

The optimizer complexity issue, of course, can be raised just as seriously for conventional compilers. You are going to optimize somewhere. But my experience with dynamic translation tells me that it's a lot easier to do (and to reason about) one thing at a time. Once we have a high-confidence optimizer in a safe language, then it may make sense to talk about integrating it into the run-time in a high-confidence system. Until then, separation of concerns should be the watch-word of the day.

Now strictly speaking, it should be said that run-time compilation actually isn't necessary for BitC, or for any other bytecode language. Run-time compilation doesn't become necessary until you combine run-time loading with compiler-abstracted representations (see below) and allow types having abstracted representation to appear in the signatures of run-time loaded libraries. Until then it is possible to maintain a proper phase separation between code generation and execution. Read on - I'll explain some of that below.

In any case, I knew going in that strongly abstracted types would raise concerns on this issue, and I initially adopted the following view:

It took several years for me to realize that the template expansion idea wasn't going to produce acceptable baseline performance. The problem lies in the interaction between abstract types, operator overloading, and inlining.

Compiler-Abstracted Representations vs. Optimization

Types have representations. This sometimes seems to make certain members of the PL community a bit uncomfortable. A thing to be held at arms length. Very much like a zip-lock bag full of dog poo (insert cartoon here). From the perspective of a systems person, I regret to report that where the bits are placed, how big they are, and their assemblage actually does matter. If you happen to be a dog owner, you'll note that the "bits as dog poo" analogy is holding up well here. It seems to be the lot of us systems people to wade daily through the plumbing of computational systems, so perhaps that shouldn't be a surprise. Ahem.

In any case, the PL community set representation issues aside in order to study type issues first. I don't think that pragmatics was forgotten, but I think it's fair to say that representation issues are not a focus in current, mainstream PL research. There is even a school of thought that views representation as a fairly yucky matter that should be handled in the compiler "by magic", and that imperative operations should be handled that way too. For systems code that approach doesn't work, because a lot of the representations and layouts we need to deal with are dictated to us by the hardware.

In any case, types do have representations, and knowledge of those representations is utterly essential for even the simplest compiler optimizations. So we need to be a bit careful not to abstract types* too * successfully, lest we manage to break the compilation model.

In C, the "+" operator is primitive, and the compiler can always select the appropriate opcode directly. Similarly for other "core" arithmetic operations. Now try a thought experiment: suppose we take every use of such core operations in a program and replace each one with a functionally equivalent procedure call to a runtime-implemented intrinsic. You only have to do this for user operations - addition introduced by the compiler to perform things like address arithmetic is always done on concrete types, so those can still be generated efficiently. But even though it is only done for user operations, this would clearly harm the performance of the program quite a lot. You can recover that performance with a run-time optimizer, but it's complicated.

In C++, the "+" operator can be overloaded. But (1) the bindings for primitive types cannot be replaced, (2) we know, statically, what the bindings and representations are for the other types, and (3) we can control, by means of inlining, which of those operations entail a procedure call at run time. I'm not trying to suggest that we want to be forced to control that manually. The key point is that the compiler has enough visibility into the implementation of the operation that it is possible to inline the primitive operators (and many others) at static compile time.

Why is this possible in C++, but not in BitC?

In C++, the instantiation of an abstract type (a template) occurs in an environment where complete knowledge of the representations involved is visible to the compiler. That information may not all be in scope to the programmer, but the compiler can chase across the scopes, find all of the pieces, assemble them together, and understand their shapes. This is what induces the "explicit instantiation" model of C++. It also causes a lot of "internal" type declarations and implementation code to migrate into header files, which tends to constrain the use of templates and increase the number of header file lines processed for each compilation unit - we measured this at one point on a very early (pre templates) C++ product and found that we processed more than 150 header lines for each "source" line. The ratio has grown since then by at least a factor of ten, and (because of templates) quite likely 20.

It's all rather a pain in the ass, but it's what makes static-compile-time template expansion possible. From the compiler perspective, the types involved (and more importantly, the representations) aren't abstracted at all. In BitC, both of these things are abstracted at static compile time. It isn't until link time that all of the representations are in hand.

Now as I said above, we can imagine extending the linkage model to deal with this. All of that header file information is supplied to deal with * representation* issues, not type checking. Representation, in the end, comes down to sizes, alignments, and offsets. Even if we don't know the concrete values, we do know that all of those are compile-time constants, and that the results we need to compute at compile time are entirely formed by sums and multiples of these constants. We could imagine dealing with these as opaque constants at static compile time, and filling in the blanks at link time. Which is more or less what I had in mind by link-time template expansion. Conceptually: leave all the offsets and sizes "blank", and rely on the linker to fill them in, much in the way that it handles relocation.

The problem with this approach is that it removes key information that is needed for optimization and registerization, and it doesn't support inlining. In BitC, we can and do extend this kind of instantiation all the way down to the primitive operators! And perhaps more importantly, to primitive accessors and mutators. The reason is that we want to be able to write expressions like "a + b" and say "that expression is well-typed provided there is an appropriate resolution for +:('a,'a)->'a". Which is a fine way to type the operation, but it leaves the representation of 'a fully abstracted. Which means that we cannot see when they are primitive types. Which means that we are exactly (or all too often, in any case) left in the position of generating all user-originated "+" operations as procedure calls. Now surprisingly, that's actually not the end of the world. We can imagine inventing some form of "high-level assembler" that our static code generator knows how to translate into machine code. If the static code generator does this, the run-time loader can be handed responsibility for emitting procedure calls, and can substitute intrinsic calls at appropriate points. Which would cause us to lose code sharing, but that might be tolerable on non-embedded targets.

Unfortunately, this kind of high-level assembler has some fairly nasty implications for optimization: First, we no longer have any idea what the * cost* of the "+" operator is for optimization purposes. We don't know how many cycles that particular use of + will take, but more importantly, we don't know how many bytes of code it will emit. And without that information there is a very long list of optimization decisions that we can no longer make at static compile time. Second, we no longer have enough information at static code generation time to perform a long list of basic register and storage optimizations, because we don't know which procedure calls are actually going to use registers.

That creaking and groaning noise that you are hearing is the run-time code generator gaining weight and losing reliability as it grows. While the impact of this mechanism actually wouldn't be as bad as I am sketching - because a lot of user types aren't abstract - the complexity of the mechanism really is as bad as I am proposing. In effect we end up deferring code generation and optimization to link time. That's an idea that goes back (at least) to David Wall's work on link time register optimization in the mid-1980s. It's been explored in many variants since then. It's a compelling idea, but it has pros and cons.

What is going on here is that types in BitC are too successfully abstracted for static compilation. The result is a rather large bag of poo, so perhaps the PL people are on to something.:-)

Two Solutions

The design point that you don't want to cross here is dynamic loading where the loaded interface carries a type with an abstracted representation. At that point you are effectively committing yourself to run-time code generation, though I do have some ideas on how to mitigate that.

Conclusion Concerning Compilation Model

If static, separate compilation is a requirement, it becomes necessary for the compiler to see into the source code across module boundaries whenever an abstract type is used. That is: any procedure having abstract type must have an exposed source-level implementation.

The practical alternative is a high-level intermediate form coupled with install-time or run-time code generation. That is certainly feasible, but it's more that I felt I could undertake.

That's all manageable and doable. Unfortunately, it isn't the path we had taken on, so it basically meant starting over.

Insufficiency of the Type System

At a certain point we had enough of BitC working to start building library code. It may not surprise you that the first thing we set out to do in the library was IO. We found that we couldn't handle typed input within the type system. Why not?

Even if you are prepared to do dynamic allocation within the IO library, there is a level of abstraction at which you need to implement an operation that amounts to "inputStream.read(someObject: ByRef mutable 'a)" There are a couple of variations on this, but the point is that you want the ability at some point to move the incoming bytes into previously allocated storage. So far so good.

Unfortunately, in an effort to limit creeping featurism in the type system, I had declared (unwisely, as it turned out) that the only place we needed to deal with ByRef types was at parameters. Swaroop took this statement a bit more literally than I intended. He noticed that if this is really the only place where ByRef needs to be handled, then you can internally treat "ByRef 'a" as 'a, merely keeping a marker on the parameter's identifier record to indicate that an extra dereference is required at code generation time. Which is actually quite clever, except that it doesn't extend well to signature matching between type classes and their instances. Since the argument type for read is ByRef 'a, InputStream is such a type class.

So now we were faced with a couple of issues. The first was that we needed to make ByRef 'a a first-class type within the compiler so that we could unify it, and the second was that we needed to deal with the implicit coercion issues that this would entail. That is: conversion back and forth between ByRef 'a and 'a at copy boundaries. The coercion part wasn't so bad; ByRef is never inferred, and the type coercions associated with ByRef happen in exactly the same places that const/mutable coercions happen. We already had a cleanly isolated place in the type checker to deal with that.

But even if ByRef isn't inferred, it can propagate through the code by unification. And that causes safety violations! The fact that ByRef was syntactically restricted to appear only at parameters had the (intentional) consequence of ensuring that safety restrictions associated with the lifespan of references into the stack were honored - that was why I had originally imposed the restriction that ByRef could appear only at parameters. Once the ByRef type can unify, the syntactic restriction no longer guarantees the enforcement of the lifespan restriction. To see why, consider what happens in:

  define byrefID(x:ByRef 'a) { return x; }

Something that is supposed to be a downward-only reference ends up getting returned up the stack. Swaroop's solution was clever, in part, because it silently prevented this propagation problem. In some sense, his implementation doesn't really treat ByRef as a type, so it can't propagate. But *because *he didn't treat it as a type, we also couldn't do the necessary matching check between instances and type classes.

It turns out that being able to do this is useful. The essential requirement of an abstract mutable "property" (in the C# sense) is that we have the ability within the language to construct a function that returns the location of the thing to be mutated. That location will often be on the stack, so returning the location is exactly like the example above. The "ByRef only at parameters" restriction is actually very conservative, and we knew that it was preventing certain kinds of things that we eventually wanted to do. We had a vague notion that we would come back and fix that at a later time by introducing region types.

As it turned out, "later" had to be "now", because region types are the right way to re-instate lifetime safety when ByRef types become first class. But adding region types presented two problems (which is why we had hoped to defer them):

Region polymorphism with region subtyping had certainly been done before, but we were looking at subtyping in another case too (below). That was pushing us toward a kinding system and a different type system.

So to fix the ByRef problem, we very nearly needed to re-design both the type system and the compiler from scratch. Given the accumulation of cruft in the compiler, that might have been a good thing in any case, but Swaroop was now full-time at Microsoft, and I didn't have the time or the resources to tackle this by myself.

Conclusion Concerning the Type System

In retrospect, it's hard to imagine a strongly typed imperative language that doesn't type locations in a first-class way. If the language simultaneously supports explicit unboxing, it is effectively forced to deal with location lifespan and escape issues, which makes memory region typing of some form almost unavoidable.

For this reason alone, even if for no other, the type system of an imperative language with unboxing must incorporate some form of subtyping. To ensure termination, this places some constraints on the use of type inference. On the bright side, once you introduce subtyping you are able to do quite a number of useful things in the language that are hard to do without it.

Inheritance and Encapsulation

Our first run-in with inheritance actually showed up in the compiler itself. In spite of our best efforts, the C++ implementation of the BitC compiler had not entirely avoided inheritance, so it didn't have a direct translation into BitC. And even if we changed the code of the compiler, there are a large number of third-party libraries that we would like to be able to transcode. A good many of those rely on [single] inheritance. Without having at least some form of interface (type) inheritance, We can't really even do a good job interfacing to those libraries as foreign objects.

The compiler aside, we also needed a mechanism for encapsulation. I had been playing with "capsules", but it soon became clear that capsules were really a degenerate form of subclassing, and that trying to duck that issue wasn't going to get me anywhere.

I could nearly imagine getting what I needed by adding "ThisType" and inherited interfaces. But the combination of those two features introduces subtyping. In fact, the combination is equivalent (from a type system perspective) to single-inheritance subclassing.

And the more I stared at interfaces, the more I started to ask myself why an interface wasn't just a type class. That brought me up against the instance coherence problem from a new direction, which was already making my head hurt. It also brought me to the realization that Interfaces work, in part, because they are always parameterized over a single type (the ThisType) - once you know that one, the bindings for all of the others are determined by type constructors or by explicit specification.

And introducing SelfType was an even bigger issue than introducing subtypes. It means moving out of System F<: entirely, and into the object type system of Cardelli et al. That wasn't just a matter of re-implementing the type checker to support a variant of the type system we already had. It meant re-formalizing the type system entirely, and learning how to think in a different model.

Doable, but time not within the framework or the compiler that we had built. At this point, I decided that I needed to start over. We had learned a lot from the various parts of the BitC effort, but sometimes you have to take a step back before you can take more steps forward.

Instance Coherence and Operator Overloading

BitC largely borrows its type classes from Haskell. Type classes aren't just a basis for type qualifiers; they provide the mechanism for *ad hoc*polymorphism. A feature which, language purists notwithstanding, real languages actually do need.

The problem is that there can be multiple type class instances for a given type class at a given type. So it is possible to end up with a function like:

define f(x : 'x) {
  ...
  a:int32 + b  // typing fully resolved at static compile time
  return x + x  // typing not resolvable until instantiation
}

Problem: we don't know which instance of "+" to use when 'x instantiates to int32. In order for "+" to be meaningful in a+b, we need a static-compile-time resolution for +:(int32, int32)->int32. And we get that from Arith(int32). So far so good. But if 'x is instantiated to int32, we will get a type class instance supplied by the caller. The problem is that there is no way to guarantee that this is the same instance of Arith(int32) that we saw before.

The solution in Haskell is to impose the ad hoc rule that you can only instantiate a type class once for each unique type tuple in a given application. This is similar to what is done in C++: you can only have one overload of a given global operator at a particular type. If there is more than one overload at that type, you get a link-time failure. This restriction is tolerable in C++ largely because operator overloading is so limited:

  1. The set of overloadable operators is small and non-extensible.
  2. Most of them can be handled satisfactorily as methods, which makes their resolution unambiguous.
  3. Most of the ones that can't be handled as methods are arithmetic operations, and there are practical limits to how much people want to extend those.
  4. The remaining highly overloaded global operators are associated with I/O. These could be methods in a suitably polymorphic language.

In languages (like BitC) that enable richer use of operator overloading, it seems unlikely that these properties would suffice.

But in Haskell and BitC, overloading is extended to type properties as well. For example, there is a type class "Ord 'a", which states whether a type 'a admits an ordering. Problem: most types that admit ordering admit more than one! The fact that we know an ordering exists really isn't enough to tell us which ordering to use. And we can't introduce two orderings for 'a in Haskell or BitC without creating an instance coherence problem. And in the end, the instance coherence problem exists because the language design performs method resolution in what amounts to a non-scoped way.

But if nothing else, you can hopefully see that the heavier use of overloading in BitC and Haskell places much higher pressure on the "single instance" rule. Enough so, in my opinion, to make that rule untenable. And coming from the capability world, I have a strong allergy to things that smell like ambient authority.

Now we can get past this issue, up to a point, by imposing an arbitrary restriction on where (which compilation unit) an instance can legally be defined. But as with the "excessively abstract types" issue, we seemed to keep tripping on type class issues. There are other problems as well when multi-variable type classes get into the picture.

At the end of the day, type classes just don't seem to work out very well as a mechanism for overload resolution without some other form of support.

A second problem with type classes is that you can't resolve operators at static compile time. And if instances are explicitly named, references to instances have a way of turning into first-class values. At that point the operator reference can no longer be statically resolved at all, and we have effectively re-invented operator methods!

Conclusion about Type Classes and Overloading:

The type class notion (more precisely: qualified types) is seductive, but absent a reasonable approach for instance coherence and lexical resolution it provides an unsatisfactory basis for operator overloading. There is a disturbingly close relationship between type class instances and object instances that needs further exploration by the PL community. The important distinction may be pragmatic rather than conceptual: type class instances are compile-time constants while object instances are run-time values. This has no major consequences for typing, but it leads to significant differences w.r.t. naming, binding, and [human] conceptualization.

There are unresolved formal issues that remain with multi-parameter type classes. Many of these appear to have natural practical solutions in a polymorphic object type system, but concerns of implementation motivate kinding distinctions between boxed and unboxed types that are fairly unsatisfactory.

Wrapping Up

The current outcome is extremely frustrating. While the blind spots here were real, we were driven by the requirements of the academic research community to spend nearly three years finding a way to do complete inference over mutability. That was an enormous effort, and it delayed our recognition that we were sitting on the wrong kind of underlying type system entirely. While I continue to think that there is some value in mutability inference, I think it's a shame that a fairly insignificant wart in the original inference mechanism managed to prevent larger-scale success in the overall project for what amount to political reasons. If not for that distraction, I think we would probably have learned enough about the I/O and the instance coherency issues to have moved to a different type system while we still had a group to do it with, and we would have a working and useful language today.

The distractions of academia aside, it is fair to ask why we weren't building small "concept test" programs as a sanity check of our design. There are a number answers, none very satisfactory:

I think we did make some interesting contributions. We now know how to do (that is: to implement) polymorphism over unboxed types with significant code sharing, and we understand how to deal with inferred mutability. Both of those are going to be very useful down the road. We have also learned a great deal about advanced type systems.

In any case, BitC in its current form clearly needs to be set aside and re-worked. I have a fairly clear notion about how I would approach continuing this work, but that's going to have to wait until someone is willing to pay for all this.

Diseconomies of scale in fraud, spam, support, and moderation

2024-02-18 08:00:00

If I ask myself a question like "I'd like to buy an SD card; who do I trust to sell me a real SD card and not some fake, Amazon or my local Best Buy?", of course the answer is that I trust my local Best Buy1 more than Amazon, which is notorious for selling counterfeit SD cards. And if I ask who do I trust more, my local reputable electronics shop (Memory Express, B&H Photo, etc.), I trust my local reputable electronics shop more. Not only are they less likely to sell me a counterfeit than Best Buy, in the event that they do sell me a counterfeit, the service is likely to be better.

Similarly, let's say I ask myself a question like, "on which platform do I get a higher rate of scams, spam, fraudulent content, etc., [smaller platform] or [larger platform]"? Generally the answer is [larger platform]. Of course, there are more total small platforms out there and they're higher variance, so I could deliberately use a smaller platform that's worse, but I'm choosing good options instead of bad options, in every size class, the smaller platform is generally better. For example, with Signal vs. WhatsApp, I've literally never received a spam Signal message, whereas I get spam WhatsApp messages somewhat regularly. Or if I compare places I might read tech content on, if I compare tiny forums no one's heard of to lobste.rs, lobste.rs has a very slightly higher rate (rate as in fraction of messages I see, not absolute message volume) of bad content because it's zero on the private forums and very low but non-zero on lobste.rs. And then if I compare lobste.rs to a somewhat larger platform, like Hacker News or mastodon.social, those have (again very slightly) higher rates of scam/spam/fraudulent content. And then if I compare that to mid-sized social media platforms, like reddit, reddit has a significantly higher and noticeable rate of bad content. And then if I can compare reddit to the huge platforms like YouTube, Facebook, Google search results, these larger platforms have an even higher rate of scams/spam/fraudulent content. And, as with the SD card example, the odds of getting decent support go down as the platform size goes up as well. In the event of an incorrect suspension or ban from the platform, the odds of an account getting reinstated get worse as the platform gets larger.

I don't think it's controversial to say that in general, a lot of things get worse as platforms get bigger. For example, when I ran a Twitter poll to see what people I'm loosely connected to think, only 2.6% thought that huge company platforms have the best moderation and spam/fraud filtering. For reference, in one poll, 9% of Americans said that vaccines implant a microchip and and 12% said the moon landing was fake. These are different populations but it seems random Americans are more likely to say that the moon landing was faked than tech people are likely to say that the largest companies have the best anti-fraud/anti-spam/moderation.

However, over the past five years, I've noticed an increasingly large number of people make the opposite claim, that only large companies can do decent moderation, spam filtering, fraud (and counterfeit) detection, etc. We looked at one example of this when we examined search results, where a Google engineer said

Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.

And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?

And a thought leader responded

like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!

But when we looked at the actual results, it turned out that, of the search engines we looked at, Mr 0.0001% Market Share was the most resistant to SEO abuse (and fairly good), Mr 0.001% was a bit resistant to SEO abuse, and Google and Bing were just flooded with SEO abuse, frequently funneling people directly to various kinds of scams. Something similar happens with email, where I commonly hear that it's impossible to manage your own email due to the spam burden, but people do it all the time and often have similar or better results than Gmail, with the main problem being interacting with big company mail servers which incorrectly ban their little email server.

I started seeing a lot of comments claiming that you need scale to do moderation, anti-spam, anti-fraud, etc., around the time Zuckerberg, in response to Elizabeth Warren calling for the breakup of big tech companies, claimed that breaking up tech companies would make content moderation issues substantially worse, saying:

It’s just that breaking up these companies, whether it’s Facebook or Google or Amazon, is not actually going to solve the issues,” Zuckerberg said “And, you know, it doesn’t make election interference less likely. It makes it more likely because now the companies can’t coordinate and work together. It doesn’t make any of the hate speech or issues like that less likely. It makes it more likely because now ... all the processes that we’re putting in place and investing in, now we’re more fragmented

It’s why Twitter can’t do as good of a job as we can. I mean, they face, qualitatively, the same types of issues. But they can’t put in the investment. Our investment on safety is bigger than the whole revenue of their company. [laughter] And yeah, we’re operating on a bigger scale, but it’s not like they face qualitatively different questions. They have all the same types of issues that we do."

The argument is that you need a lot of resources to do good moderation and smaller companies, Twitter sized companies (worth ~$30B at the time), can't marshal the necessary resources to do good moderation. I found this statement quite funny at the time because, pre-Twitter acquisition, I saw a much higher rate of obvious scam content on Facebook than on Twitter. For example, when I clicked through Facebook ads during holiday shopping season, most were scams and, while Twitter had its share of scam ads, it wasn't really in the same league as Facebook. And it's not just me — Arturo Bejar, who designed an early version of Facebook's reporting system and headed up some major trust and safety efforts noticed something similar (see footnote for details)2.

Zuckerberg seems to like the line of reasoning mentioned above, though, as he's made similar arguments elsewhere, such as here, in a statement the same year that Meta's internal docs made the case that they were exposing 100k minors a day to sexual abuse imagery:

To some degree when I was getting started in my dorm room, we obviously couldn’t have had 10,000 people or 40,000 people doing content moderation then and the AI capacity at that point just didn’t exist to go proactively find a lot of harmful content. At some point along the way, it started to become possible to do more of that as we became a bigger business

The rhetorical sleight of hand here is the assumption that Facebook needed 10k or 40k people doing content moderation when Facebook was getting started in Zuckerberg's dorm room. Services that are larger than dorm-room-Facebook can and do have better moderation than Facebook today with a single moderator, often one who works part time. But as people talk more about pursuing real antitrust action against big tech companies, tech big tech founders and execs have ramped up the anti-antitrust rhetoric, making claims about all sorts of disasters that will befall humanity if the biggest companies are broken up into the size of the biggest tech companies of 2015 or 2010. This kind of reasoning seems to be catching on a bit, as I've seen more and more big company employees state very similar reasoning. We've come a long way since the 1979 IBM training manual which read

A COMPUTER CAN NEVER BE HELD ACCOUNTABLE

THEREFORE A COMPUTER MUST NEVER MAKE A MANAGEMENT DECISION

The argument is now, for many critical decisions, it is only computers that can make most of the decisions and the lack of accountability seems to ultimately a feature, not a bug.

But unfortunately for Zuckerberg's argument3, there are at least three major issues in play here where diseconomies of scale dominate. One is that, given material that nearly everyone can agree is bad (such as bitcoin scams, spam for fake pharmaceutical products, fake weather forecasts, adults sending photos of their genitals to children), etc., large platforms do worse than small ones. The second is that, for the user, errors are much more costly and less fixable as companies get bigger because support generally becomes worse. The third is that, as platforms scale up, a larger fraction of users will strongly disagree about what should be allowed on the platform.

With respect to the first, while it's true that big companies have more resources, the cocktail party idea that they'll have the best moderation because they have the most resources is countered by the equally simplistic idea that they'll have the worst moderation because they're the juiciest targets or that they'll have the worst moderation because they'll have worst fragmentation due to the standard diseconomies of scale that occur when you scale up organizations and problem domains. Whether or not the company having more resources or these other factors dominate is too complex to resolve theoretically, but can observe the result empirically. At least at the level of resources that big companies choose to devote to moderation, spam, etc., having the larger target and other problems associated with scale dominate.

While it's true that these companies are wildly profitable and could devote enough resources to significantly reduce this problem, they have chosen not to do this. For example, in the last year before I wrote this sentence, Meta's last-year profit before tax (through December 2023) was $47B. If Meta had a version of the internal vision statement of a power company a friend mine worked for ("Reliable energy, at low cost, for generations.") and operated like that power company did, trying to create a good experience for the user instead of maximizing profit plus creating the metaverse, they could've spent the $50B they spent on the metaverse on moderation platforms and technology and then spent $30k/yr (which would result in a very good income in most countries where moderators are hired today, allowing them to have their pick of who to hire) on 1.6 million additional full-time staffers for things like escalations and support, on the order of one additional moderator or support staffer per few thousand users (and of course diseconomies of scale apply to managing this many people). I'm not saying that Meta or Google should do this, just that whenever someone at big tech company says something like "these systems have to be fully automated because no one could afford to operate manual systems at our scale", what's really being said is more along the lines of "we would not be able to generate as many billions a year in profit if we hired enough competent people to manually review cases our system should flag as ambiguous, so we settle for what we can get without compromising profits".4 One can defend that choice, but it is a choice.

And likewise for claims about advantages of economies of scale. There are areas where economies of scale legitimately make the experience better for users. For example, when we looked at why it's so hard to buy things that work well, we noted that Amazon's economies of scale have enabled them to build out their own package delivery service that is, while flawed, still more reliable than is otherwise available (and this has only improved since they added the ability for users to rate each delivery, which no other major package delivery service has). Similarly, Apple's scale and vertical integration has allowed them to build one of the all-time great performance teams (as measured by normalized performance relative to competitors of the same era), not only wiping the floor with the competition on benchmarks, but also providing a better experience in ways that no one really measured until recently, like device latency. For a more mundane example of economies of scale, crackers and other food that ships well are cheaper on Amazon than in my local grocery store. It's easy to name ways in which economies of scale benefit the user, but this doesn't mean that we should assume that economies of scale dominate diseconomies of scale in all areas. Although it's beyond the scope of this post, if we're going to talk about whether or not users are better off if companies are larger or smaller, we should look at what gets better when companies get bigger and what gets worse, not just assume that everything will get better just because some things get better (or vice versa).

Coming back to the argument that huge companies have the most resources to spend on moderation, spam, anti-fraud, etc., vs. the reality that they choose to spend those resources elsewhere, like dropping $50B on the Metaverse and not hiring 1.6 million moderators and support staff that they could afford to hire, it makes sense to look at how much effort is being expended. Meta's involvement in Myanmar makes for a nice case study because Erin Kissane wrote up a fairly detailed 40,000 word account of what happened. The entirety of what happened is a large and complicated issue (see appendix for more discussion) but, for the main topic of this post, the key components are that there was an issue that most people can generally agree should be among the highest priority moderation and support issues and that, despite repeated, extremely severe and urgent, warnings to Meta staff at various levels (engineers, directors, VPs, execs, etc.), almost no resources were dedicated to the issue while internal documents indicate that only a small fraction of agreed-upon bad content was caught by their systems (on the order of a few percent). I don't think this is unique to Meta and this matches my experience with other large tech companies, both as a user of their products and as an employee.

To pick a smaller scale example, an acquaintance of mine had their Facebook account compromised and it's now being used for bitcoin scams. The person's name is Samantha K. and some scammer is doing enough scamming that they didn't even bother reading her name properly and have been generating very obviously faked photos where someone holds up a sign and explains how "Kamantha" has helped them make tens or hundreds of thousands of dollars. This is a fairly common move for "hackers" to make and someone else I'm connected to on FB reported that this happened to their account and they haven't been able to recover the old account or even get it banned despite the constant stream of obvious scams being posted by the account.

By comparison, on lobste.rs, I've never seen a scam like this and Peter Bhat Harkins, the head mod says that they've never had one that he knows of. On Mastodon, I think I might've seen one once in my feed, replies, or mentions. Of course, Mastodon is big enough that you can find some scams if you go looking for them, but the per-message and per-user rates are low enough that you shouldn't encounter them as a normal user. On Twitter (before the acquisition) or reddit, moderately frequently, perhaps an average of once every few weeks in my normal feed. On Facebook, I see things like this all the time; I get obvious scam consumer good sites every shopping season, and the bitcoin scams, both from ads as well as account takeovers, are year-round. Many people have noted that they don't bother reporting these kinds of scams anymore because they've observed that Facebook doesn't take action on their reports. Meanwhile, Reuven Lerner was banned from running Facebook ads on their courses about Python and Pandas, seemingly because Facebook systems "thought" that Reuven was advertising something to do with animal trading (as opposed to programming). This is the fidelity of moderation and spam control that Zuckerberg says cannot be matched by any smaller company. By the way, I don't mean to pick on Meta in particular; if you'd like examples with a slightly different flavor, you can see the appendix of Google examples for a hundred examples of automated systems going awry at Google.

A reason this comes back to being an empirical question is that all of this talk about how economies of scale allows huge companies to bring more resources to bear on the problem on matters if the company chooses to deploy those resources. There's no theoretical force that makes companies deploy resources in these areas, so we can't reason theoretically. But we can observe that the resources deployed aren't sufficient to match the problems, even in cases where people would generally agree that the problem should very obviously be high priority, such as with Meta in Myanmar. Of course, when it comes to issues where the priority is less obvious, resources are also not deployed there.

On the second issue, support, it's a meme among tech folks that the only way to get support as a user of one of the big platforms is to make a viral social media post or know someone on the inside. This compounds the issue of bad moderation, scam detection, anti-fraud, etc., since those issues could be mitigated if support was good.

Normal support channels are a joke, where you either get a generic form letter rejection, or a kafkaesque nightmare followed by a form letter rejection. For example, when Adrian Black was banned from YouTube for impersonating Adrian Black (to be clear, he was banned for impersonating himself, not someone else with the same name), after appealing, he got a response that read

unfortunately, there's not more we can do on our end. your account suspension & appeal were very carefully reviewed & the decision is final

In another Google support story, Simon Weber got the runaround from Google support when he was trying to get information he needed to pay his taxes

accounting data exports for extensions have been broken for me (and I think all extension merchants?) since April 2018 [this was written on Sept 2020]. I had to get the NY attorney general to write them a letter before they would actually respond to my support requests so that I could properly file my taxes

There was also the time YouTube kept demonetizing PointCrow's video of eating water with chopsticks (he repeatedly dips chopsticks into water and then drinks the water, very slowly eating a bowl of water).

Despite responding with things like

we're so sorry about that mistake & the back and fourth [sic], we've talked to the team to ensure it doesn't happen again

He would get demonetized again and appeals would start with the standard support response strategy of saying that they took great care in examining the violating under discussion but, unfortunately, the user clearly violated the policy and therefore nothing can be done:

We have reviewed your appeal ... We reviewed your content carefully, and have confirmed that it violates our violent or graphic content policy ... it's our job to make sure that YouTube is a safe place for all

These are high-profile examples, but of course having a low profile doesn't stop you from getting banned and getting the same basically canned response, like this HN user who was banned for selling a vacuum in FB marketplace. After a number of appeals, he was told

Unfortunately, your account cannot be reinstated due to violating community guidelines. The review is final

When paid support is optional, people often say you won't have these problems if you pay for support, but people who use Google One paid support or Facebook and Instagram's paid creator support generally report that the paid support is no better than the free support. Products that effectively have paid support built-in aren't necessarily better, either. I know people who've gotten the same kind of runaround you get from free Google support with Google Cloud, even when they're working for companies that have 8 or 9 figure a year Google Cloud spend. In one of many examples, the user was seeing that Google must've been dropping packets and Google support kept insisting that the drops were happening in the customer's datacenter despite packet traces showing that this could not possibly be the case. The last I heard, they gave up on that one, but sometimes when an issue is a total showstopper, someone will call up a buddy of theirs at Google to get support because the standard support is often completely ineffective. And this isn't unique to Google — at another cloud vendor, a former colleague of mine was in the room for a conversation where a very senior engineer was asked to look into an issue where a customer was complaining that they were seeing 100% of packets get dropped for a few seconds at a time, multiple times an hour. The engineer responded with something like "it's the cloud, they should deal with it", before being told they couldn't ignore the issue as usual because the issue was coming from [VIP customer] and it was interrupting [one of the world's largest televised sporting events]. That one got fixed, but, odds are, you aren't that important, even if you're paying hundreds of millions a year.

And of course this kind of support isn't unique to cloud vendors. For example, there was this time Stripe held $400k from a customer for over a month without explanation, and every request to support got a response that was as ridiculous as the ones we just looked at. The user availed themself of the only reliable Stripe support mechanism, posting to HN and hoping to hit #1 on the front page, which worked, although many commenters said made the usual comments like "Flagged because we are seeing a lot of these on HN, and they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories", with multiple people suggesting or insinuating that the user is doing something illicit or fraudulent, but it turned out that it was an error on Stripe's end, compounded by Stripe's big company support. At one point, the user notes

While I was writing my HN post I was also on chat with Stripe for over an hour. No new information. They were basically trying to shut down the chat with me until I sent them the HN story and showed that it was getting some traction. Then they started working on my issue again and trying to communicate with more people

And then the issue was fixed the next day.

Although, in principle, as companies become larger, they could leverage their economies of scale to deliver more efficient support, instead, they tend to use their economies of scale to deliver worse, but cheaper and more profitable support. For example, on Google Play store approval support, a Google employee notes:

a lot of that was outsourced to overseas which resulted in much slower response time. Here stateside we had a lot of metrics in place to fast response. Typically your app would get reviewed the same day. Not sure what it's like now but the managers were incompetent back then even so

And a former FB support person notes:

The big problem here is the division of labor. Those who spend the most time in the queues have the least input as to policy. Analysts are able to raise issues to QAs who can then raise them to Facebook FTEs. It can take months for issues to be addressed, if they are addressed at all. The worst part is that doing the common sense thing and implementing the spirit of the policy, rather than the letter, can have a negative effect on your quality score. I often think about how there were several months during my tenure when most photographs of mutilated animals were allowed on a platform without a warning screen due to a carelessly worded policy "clarification" and there was nothing we could do about it.

If you've ever wondered why your support person is responding nonsensically, sometimes it's the obvious reason that support has been outsourced to someone making $1/hr (when I looked up the standard rates for one country that a lot of support is outsourced to, a fairly standard rate works out to about $1/hr) who doesn't really speak your language and is reading from a flowchart without understanding anything about the system they're giving support for, but another, less obvious, reason is that the support person may be penalized and eventually fired if they take actions that make sense instead of following the nonsensical flowchart that's in front of them.

Coming back to the "they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories" comment, this is a sentiment I've commonly seen expressed by engineers at companies that mete out arbitrary and capricious bans. I'm sympathetic to how people get here. As I noted before I joined Twitter, commenting on public information

Turns out twitter is removing ~1M bots/day. Twitter only has ~300M MAU, making the error tolerance v. low. This seems like a really hard problem ... Gmail's spam filter gives me maybe 1 false positive per 1k correctly classified ham ... Regularly wiping the same fraction of real users in a service would be [bad].

It is actually true that, if you, an engineer, dig into the support queue at some giant company and look at people appealing bans, almost all of the appeals should be denied. But, my experience from having talked to engineers working on things like anti-fraud systems is that many, and perhaps most, round "almost all" to "all", which is both quantitatively and qualitatively different. Having engineers who work on these systems believe that "all" and not "almost all" of their decisions are correct results in bad experiences for users.

For example, there's a social media company that's famous for incorrectly banning users (at least 10% of people I know have lost an account due to incorrect bans and, if I search for a random person I don't know, there's a good chance I get multiple accounts for them, with some recent one that has a profile that reads "used to be @[some old account]", with no forward from the old account to the new one because they're now banned). When I ran into a senior engineer from the team that works on this stuff, I asked him why so many legitimate users get banned and he told me something like "that's not a problem, the real problem is that we don't ban enough accounts. Everyone who's banned deserves it, it's not worth listening to appeals or thinking about them". Of course it's true that most content on every public platform is bad content, spam, etc., so if you have any sort of signal at all on whether or not something is bad content, when you look at it, it's likely to be bad content. But this doesn't mean the converse, that almost no users are banned incorrectly, is true. And if senior people on the team that classifies which content is bad have the attitude that we shouldn't worry about false positives because almost all flagged content is bad, we'll end up with a system that has a large number of false positives. I later asked around to see what had ever been done to reduce false positives in the fraud detection systems and found out that there was no systematic attempt at tracking false positives at all, no way to count cases where employees filed internal tickets to override bad bans, etc.; At the meta level, there was some mechanism to decrease the false negative rate (e.g., someone sees bad content that isn't being caught then adds something to catch more bad content) but, without any sort of tracking of false positives, there was effectively no mechanism to decrease the false positive rate. It's no surprise that this meta system resulted in over 10% of people I know getting incorrect suspensions or bans. And, as Patrick McKenzie says, the optimal rate of false positives isn't zero. But when you have engineers who have the attitude that they've done enough legwork that false positives are impossible, it's basically guaranteed that the false positive rate is higher than optimal. When you combine this with normal big company levels of support, it's a recipe for kafkaesque user experiences.

Another time, I commented on how an announced change in Uber's moderation policy seemed likely to result in false positive bans. An Uber TL immediately took me to task, saying that I was making unwarranted assumptions on how banning works, that Uber engineers go to great lengths to make sure that there are no false positive bans, there's extensive to review to make sure that bans are valid and, in fact, the false positive banning I was concerned about could never happen. And then I got effectively banned due to a false positive in a fraud detection system. I was remind of that incident when Uber incorrectly banned a driver who had to take them to court to even get information on why he was banned, at which point Uber finally actually looked into it (instead of just responding to appeals with fake messages claiming they'd looked into it). Afterwards, Uber responded to a press inquiry with

We are disappointed that the court did not recognize the robust processes we have in place, including meaningful human review, when making a decision to deactivate a driver’s account due to suspected fraud

Of course, in that driver's case, there was no robust process for review, nor was there a robust appeals process for my case. When I contacted support, they didn't really read my message and made some change that broke my account even worse than before. Luckily, I have enough Twitter followers that some Uber engineers saw my tweet about the issue and got me unbanned, but that's not an option that's available to most people, leading to weird stuff like this Facebook ad targeted at Google employees, from someone desperately seeking help with their Google account.

And even when you know someone on the inside, it's not always easy to get the issue fixed because even if the company's effectiveness doesn't increase as the company gets bigger, the complexity of the systems does increase. A nice example of this is Gergely Orosz's story about when the manager of the payments team left Uber and then got banned from Uber due to some an inscrutable ML anti-fraud algorithm deciding that the former manager of the payments team was committing payments fraud. It took six months of trying to get the problem fixed to mitigate the issue. And, by the way, they never managed to understand what happened and fix the underlying issue; instead, they added the former manager of the payments team to a special whitelist, not fixing the issue for any other user and, presumably, severely reducing or perhaps even entirely removing payment fraud protections for the former manager's account.

No doubt they would've fixed the underlying issue if it were easy to, but as companies scale up, they produce both technical and non-technical bureaucracy that makes systems opaque even to employees.

Another example of that is, at a company that has a ranked social feed, the idea that you could eliminate stuff you didn't want in your ranked feed by adding filters for things like timeline_injection:false, interstitial_ad_op_out, etc., would go viral. The first time this happened, a number of engineers looked into it and thought that the viral tricks didn't work. They weren't 100% sure and were relying on ideas like "no one can recall a system that would do something like this ever being implemented" and "if you search the codebase for these strings, they don't appear", and "we looked at the systems we think might do this and they don't appear to do this". There was moderate confidence that this trick didn't work, but no one would state with certainty that the trick didn't work because, as at all large companies, the aggregate behavior of the system is beyond human understanding and even parts that could be understood often aren't because there are other priorities.

A few months later, the trick went viral again and people were generally referred to the last investigation when they asked if it was real, except that one person actually tried the trick and reported that it worked. They wrote a slack message about how the trick did work for them, but almost no one noticed that the one person who tried reproducing the trick found that it worked. Later, when the trick would go viral again, people would point to the discussions about how people thought the trick didn't work, with this message noting that it appears to work (almost certainly not by the mechanism that users think, and instead just because having a long list of filters causes something to time out, or something similar) basically got lost because there's too much information to read all of it.

In my social circles, many people have read James Scott's Seeing Like a State, which is subtitled How Certain Schemes to Improve the Human World Have Failed. A key concept from the book is "legibility", what a state can see, and how this distorts what states do. One could easily write a highly analogous book, Seeing like a Tech Company about what's illegible to companies that scale up, at least as companies are run today. A simple example of this is that, in many video games, including ones made by game studios that are part of a $3T company, it's easy to get someone suspended or banned by having a bunch of people report the account for bad behavior. What's legible to the game company is the rate of reports and what's not legible is the player's actual behavior (it could be legible, but the company chooses not to have enough people or skilled enough people examine actual behavior); and many people have reported similar bannings with social media companies. When it comes to things like anti-fraud systems, what's legible to the company tends to be fairly illegible to humans, even humans working on the anti-fraud systems themselves.

Although he wasn't specifically talking about an anti-fraud system, in a Special Master's System, Eugene Zarashaw, a director a Facebook made this comment which illustrates the illegibility of Facebook's own systems:

It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively

Facebook was unfairly and mostly ignorantly raked over the coals for this statement (we'll discuss that in an appendix), but it is generally true that it's difficult to understand how a system the size of Facebook works.

In principle, companies could augment the legibility of their inscrutable systems by having decently paid support people look into things that might be edge-case issues with severe consequences, where the system is "misunderstanding" what's happening but, in practice, companies pay these support people extremely poorly and hire people who really don't understand what's going on, and then give them instructions which ensure that they generally do not succeed at resolving legibility issues.

One thing that helps the forces of illegibility win at scale is that, as a highly-paid employee of one of these huge companies, it's easy to look at the millions or billions of people (and bots) out there and think of them all as numbers. As the saying goes, "the death of one man is a tragedy. The death of a million is a statistic" and, as we noted, engineers often turn thoughts like "almost all X is fraud" to "all X is fraud, so we might as well just ban everyone who does X and not look at appeals". The culture that modern tech companies have, of looking for scalable solutions at all costs, makes this worse than in other industries even at the same scale, and tech companies also have unprecedented scale.

For example, in response to someone noting that FB Ad Manager claims you can run an ad with a potential reach of 101M people in the U.S. aged 18-34 when the U.S. census had the total population of people aged 18-34 as 76M, the former PM of the ads targeting team responded with

Think at FB scale

And explained that you can't expect slice & dice queries to work for something like the 18-34 demographic in the U.S. at "FB scale". There's a meme at Google that's used ironically in cases like this, where people will say "I can't count that low". Here's the former PM of FB ads saying, non-ironically, "FB can't count that low" for numbers like 100M. Not only does FB not care about any individual user (unless they're famous), this PM claims they can't be bothered to care that groups of 100M people are tracked accurately.

Coming back to the consequences of poor support, a common response to hearing about people getting incorrectly banned from one of these huge services is "Good! Why would you want to use Uber/Amazon/whatever anyway? They're terrible and no one should use them". I disagree with this line of reasoning. For one thing, why should you decide for that person whether or not they should use a service or what's good for them? For another (and this this is a large enough topic that it should be its own post, so I'll just mention it briefly and link to this lengthier comment from @whitequark) most services that people write off as unnecessary conveniences that you should just do without are actually serious accessibility issues for quite a few people (in absolute, not necessarily, percentage, terms). When we're talking about small businesses, those people can often switch to another business, but with things like Uber and Amazon, there are sometimes zero or one alternatives that offer similar convenience and when there's one, getting banned due to some random system misfiring can happen with the other service as well. For example, in response to many people commenting on how you should just issue a chargeback and get banned from DoorDash when they don't deliver, a disabled user responds:

I'm disabled. Don't have a driver's license or a car. There isn't a bus stop near my apartment, I actually take paratransit to get to work, but I have to plan that a day ahead. Uber pulls the same shit, so I have to cycle through Uber, Door dash, and GrubHub based on who has coupons and hasn't stolen my money lately. Not everyone can just go pick something up.

Also, when talking about this class of issue, involvement is often not voluntary, such as in the case of this Fujitsu bug that incorrectly put people in prison.

On the third issue, the impossibility of getting people to agree on what constitutes spam, fraud, and other disallowed content, we discussed that in detail here. We saw that, even in a trivial case with a single, uncontroversial, simple, rule, people can't agree on what's allowed. And, as you add more rules or add topics that are controversial or scale up the number of people, it becomes even harder to agree on what should be allowed.

To recap, we looked at three areas where diseconomies of scale make moderation, support, anti-fraud, and anti-spam worse as companies get bigger. The first was that, even in cases where there's broad agreement that something is bad, such as fraud/scam/phishing websites and search, the largest companies with the most sophisticated machine learning can't actually keep up with a single (albeit very skilled) person working on a small search engine. The returns to scammers are much higher if they take on the biggest platforms, resulting in the anti-spam/anti-fraud/etc. problem being extremely non-linearly hard.

To get an idea of the difference in scale, HN "hellbans" spammers and people who post some kinds of vitriolic comments. Most spammers don't seem to realize they're hellbanned and will keep posting for a while, so if you browse the "newest" (submissions) page while logged in, you'll see a steady stream of automatically killed stories from these hellbanned users. While there are quite a few of them, the percentage is generally well under half. When we looked at a "mid-sized" big tech company like Twitter circa 2017, based on the public numbers, if spam bots were hellbanned instead of removed, spam is so much more prevalent that all you'd see if you were able to see it. And, as big companies go, 2017-Twitter isn't that big. As we also noted, the former PM of FB ads targeting explained that numbers as low as 100M are in the "I can't count that low" range, too small to care about; to him, basically a rounding error. The non-linear difference in difficulty is much worse for a company like FB or Google. The non-linearity of the difficulty of this problems is, apparently, more than a match for whatever ML or AI techniques Zuckerberg and other tech execs want to brag about.

In testimony in front of Congress, you'll see execs defend the effectiveness of these systems at scale with comments like "we can identify X with 95% accuracy", a statement that may technically be correct, but seems designed to deliberately mislead an audience that's presumed to be innumerate. If you use, as a frame of reference, things at a personal scale, 95% might sound quite good. Even for something like HN's scale, 95% accurate spam detection that results in an immediate ban might be sort of alright. Anyway, even if it's not great, people who get incorrectly banned can just email Dan Gackle, who will unban them. As we noted when we looked at the numbers, 95% accurate detection at Twitter's scale would be horrible (and, indeed, the majority of DMs I get are obvious spam). Either you have to back off and only ban users in cases where you're extremely confident, or you ban all your users after not too long and, as companies like to handle support, appealing means that you'll get a response saying that "your case was carefully reviewed and we have determined that you've violated our policies. This is final", even for cases where any sort of cursory review would cause a reversal of the ban, like when you ban a user for impersonating themselves. And then at FB's scale, it's even worse and you'll ban all of your users even more quickly, so then you back off and we end up with things like 100k minors a day being exposed to "photos of adult genitalia or other sexually abusive content".

The second area we looked at was support, which tends to get worse as companies get larger. At a high level, it's fair to say that companies don't care to provide decent support (with Amazon being somewhat of an exception here, especially with AWS, but even on the consumer side). Inside the system, there are individuals who care, but if you look at the fraction of resources expended on support vs. growth or even fun/prestige projects, support is an afterthought. Back when deepmind was training a StarCraft AI, it's plausible that Alphabet was spending more money playing Starcraft than on support agents (and, if not, just throw in one or two more big AI training projects and you'll be there, especially if you include the amortized cost of developing custom hardware, etc.).

It's easy to see how little big companies care. All you have to do is contact support and get connected to someone who's paid $1/hr to respond to you in a language they barely know, attempting to help solve a problem they don't understand by walking through some flowchart, or appeal an issue and get told "after careful review, we have determined that you have [done the opposite of what you actually did]". In some cases, you don't even need to get that far, like when following Instagram's support instructions results in an infinite loop that takes you back where you started and the "click here if this wasn't you link returns a 404". I've run into an infinite loop like this once, with Verizon, and it persisted for at least six months. I didn't check after that, but I'd bet on it persisting for years. If you had an onboarding or sign-up page that had an issue like this, that would be considered a serious bug that people should prioritize because that impacts growth. But for something like account loss due to scammers taking over accounts, that might get fixed after months or years. Or maybe not.

If you ever talk to people who work in support at a company that really cares about support, it's immediately obvious that they operate completely different from typical big tech company support, in terms of process as well as culture. Another way you can tell that big companies don't care about support is how often big company employees and execs who've never looked into how support is done or could be done will tell you that it's impossible to do better.

When you talk to people who work on support at companies that do actually care about this, it's apparent that it can be done much better. While I was writing this post, I actually did support at a company that does support decently well (for a tech company, adjusted for size, I'd say they're well above 99%-ile), including going through the training and onboarding process for support folks. Executing anything well at scale is non-trivial, so I don't mean to downplay how good their support org is, but the most striking thing to me was how much of the effectiveness of the org naturally followed from caring about providing a good support experience for the user. A full discussion of what that means is too long to include here, so we'll look at this in more detail another time, but one example is that, when we look at how big company support responds, it's often designed to discourage the user from responding ("this review is final") or to justify, putatively to the user, that the company is doing an adequate job ("this was not a purely automated process and each appeal was reviewed by humans in a robust process that ... "). This company's training instructs you to do the opposite of the standard big company "please go away"-style and "we did a great job and have a robust process, therefore complaints are invalid"-style responses. For every anti-pattern you commonly see in support, the training tells you to do the opposite and discusses why the anti-pattern results in a bad user experience. Moreover, the culture has deeply absorbed these ideas (or rather, these ideas come out of the culture) and there are processes for ensuring that people really know what it means to provide good support and follow through on it, support folks have ways to directly talk to the developers who are implementing the product, etc.

If people cared about doing good support, they could talk to people who work in support orgs that are good at helping users or even try working in one before explaining how it's impossible to do better, but this generally isn't done. Their company's support org leadership could do this as well, or do what I did and actually directly work in a support role in an effective support org, but this doesn't happen. If you're a cynic, this all makes sense. In the same way that cynics advise junior employees "big company HR isn't there to help you; their job is to protect the company", a cynic can credibly argue "big company support isn't there to help the user; their job is to protect the company", so of course big companies don't try to understand how companies that are good at supporting users do support because that's not what big company support is for.

The third area we looked at was how it's impossible for people to agree on how a platform should operate and how people's biases mean that people don't understand how difficult a problem this is. For Americans, a prominent case of this are the left and right wing conspiracy theories that pop up every time some bug pseudo-randomly causes any kind of service disruption or banning.

In a tweet, Ryan Greeberg joked:

Come work at Twitter, where your bugs TODAY can become conspiracy theories of TOMORROW!

In my social circles, people like to make fun of all of the absurd right-wing conspiracy theories that get passed around after some bug causes people to incorrectly get banned, causes the site not to load, etc., or even when some new ML feature correctly takes down a huge network of scam/spam bots, which also happens to reduce the follower count of some users. But of course this isn't unique to the right, and left-wing thought leaders and politicians come up with their own conspiracy theories as well.

Putting all three of these together, worse detection of issues, worse support, and a harder time reaching agreement on policies, we end with the situation we noted at the start where, in a poll of my Twitter followers, people who mostly work in tech and are generally fairly technically savvy, only 2.6% of people thought that the biggest companies were the best at moderation and spam/fraud filtering, so it might seem a bit silly to spend so much time belaboring the point. When you sample the U.S population at large, a larger fraction of people say they believe in conspiracy theories like vaccines putting a microchip in you or that we never landed on the moon, and I don't spend my time explaining why vaccines do not actually put a microchip in you or why it's reasonable to think that we landed on the moon. One reason that would perhaps be reasonable is that I've been watching the "only big companies can handle these issues" rhetoric with concern as it catches on among non-technical people, like regulators, lawmakers, and high-ranking government advisors, who often listen to and then regurgitate nonsense. Maybe next time you run into a lay person who tells you that only the largest companies could possibly handle these issues, you can politely point out that there's very strong consensus the other way among tech folks5.

If you're a founder or early-stage startup looking for an auth solution, PropelAuth is targeting your use case. Although they can handle other use cases, they're currently specifically trying to make life easier for pre-launch startups that haven't invested in an auth solution yet. Disclaimer: I'm an investor

Thanks to Gary Bernhardt, Peter Bhat Harkins, Laurence Tratt, Dan Gackle, Sophia Wisdom, David Turner, Yossi Kreinin, Justin Blank, Ben Cox, Horace He, @borzhemsky, Kevin Burke, Bert Muthalaly, Sasuke, anonymous, Zach Manson, Joachim Schipper, Tony D'Souza, and @GL1zdA for comments/corrections/discussion.

Appendix: techniques that only work at small scale

This post has focused on the disadvantages of bigness, but we can also flip this around and look at the advantages of smallness.

As mentioned, the best experiences I've had on platforms are a side effect of doing things that don't scale. One thing that can work well is to have a single person, with a single vision, handling the entire site or, when that's too big, a key feature of the site.

I'm on a number of small discords that have good discussion and essentially zero scams, spam, etc. The strategy for this is simple; the owner of the channel reads every message and bans and scammers or spammers who show up. When you get to a bigger site, like lobste.rs, or even bigger like HN, that's too large for someone to read every message (well, this could be done for lobste.rs, but considering that it's a spare-time pursuit for the owner and the volume of messages, it's not reasonable to expect them to read every message in a short timeframe), but there's still a single person who provides the vision for what should happen, even if the sites are large enough that it's not reasonable to literally read every message. The "no vehicles in the park" problem doesn't apply here because a person decides what the policies should be. You might not like those policies, but you're welcome to find another small forum or start your own (and this is actually how lobste.rs got started — under HN's previous moderation regime, which was known for banning people who disagreed with them, Joshua Stein was banned for publicly disagreeing with an HN policy, so Joshua created lobsters (and then eventually handed it off to Peter Bhat Harkins).

There's also this story about craigslist in the early days, as it was just getting big enough to have a serious scam and spam problem

... we were stuck at SFO for something like four hours and getting to spend half a workday sitting next to Craig Newmark was pretty awesome.

I'd heard Craig say in interviews that he was basically just "head of customer service" for Craigslist but I always thought that was a throwaway self-deprecating joke. Like if you ran into Larry Page at Google and he claimed to just be the janitor or guy that picks out the free cereal at Google instead of the cofounder. But sitting next to him, I got a whole new appreciation for what he does. He was going through emails in his inbox, then responding to questions in the craigslist forums, and hopping onto his cellphone about once every ten minutes. Calls were quick and to the point "Hi, this is Craig Newmark from craigslist.org. We are having problems with a customer of your ISP and would like to discuss how we can remedy their bad behavior in our real estate forums". He was literally chasing down forum spammers one by one, sometimes taking five minutes per problem, sometimes it seemed to take half an hour to get spammers dealt with. He was totally engrossed in his work, looking up IP addresses, answering questions best he could, and doing the kind of thankless work I'd never seen anyone else do with so much enthusiasm. By the time we got on our flight he had to shut down and it felt like his giant pile of work got slightly smaller but he was looking forward to attacking it again when we landed.

At some point, if sites grow, they get big enough that a person can't really own every feature and every moderation action on the site, but sites can still get significant value out of having a single person own something that people would normally think is automated. A famous example of this is how the Digg "algorithm" was basically one person:

What made Digg work really was one guy who was a machine. He would vet all the stories, infiltrate all the SEO networks, and basically keep subverting them to keep the Digg front-page usable. Digg had an algorithm, but it was basically just a simple algorithm that helped this one dude 10x his productivity and keep the quality up.

Google came to buy Digg, but figured out that really it's just a dude who works 22 hours a day that keeps the quality up, and all that talk of an algorithm was smoke and mirrors to trick the SEO guys into thinking it was something they could game (they could not, which is why front page was so high quality for so many years). Google walked.

Then the founders realised if they ever wanted to get any serious money out of this thing, they had to fix that. So they developed "real algorithms" that independently attempted to do what this one dude was doing, to surface good/interesting content.

...

It was a total shit-show ... The algorithm to figure out what's cool and what isn't wasn't as good as the dude who worked 22 hours a day, and without his very heavy input, it just basically rehashed all the shit that was popular somewhere else a few days earlier ... Instead of taking this massive slap to the face constructively, the founders doubled-down. And now here we are.

...

Who I am referring to was named Amar (his name is common enough I don't think I'm outing him). He was the SEO whisperer and "algorithm." He was literally like a spy. He would infiltrate the awful groups trying to game the front page and trick them into giving him enough info that he could identify their campaigns early, and kill them. All the while pretending to be an SEO loser like them.

Etsy supposedly used the same strategy as well.

Another class of advantage that small sites have over large ones is that the small site usually doesn't care about being large and can do things that you wouldn't do if you wanted to grow. For example, consider these two comments made in the midst of a large flamewar on HN

My wife spent years on Twitter embroiled in a very long running and bitter political / rights issue. She was always thoughtful, insightful etc. She'd spend 10 minutes rewording a single tweet to make sure it got the real point across in a way that wasn't inflammatory, and that had a good chance of being persuasive. With 5k followers, I think her most popular tweets might get a few hundred likes. The one time she got drunk and angry, she got thousands of supportive reactions, and her followers increased by a large % overnight. And that scared her. She saw the way "the crowd" was pushing her. Rewarding her for the smell of blood in the water.

I've turned off both the flags and flamewar detector on this article now, in keeping with the first rule of HN moderation, which is (I'm repeating myself but it's probably worth repeating) that we moderate HN less, not more, when YC or a YC-funded startup is part of a story ... Normally we would never late a ragestorm like this stay on the front page—there's zero intellectual curiosity here, as the comments demonstrate. This kind of thing is obviously off topic for HN: https://news.ycombinator.com/newsguidelines.html. If it weren't, the site would consist of little else. Equally obvious is that this is why HN users are flagging the story. They're not doing anything different than they normally would.

For a social media site, low-quality high-engagement flamebait is one of the main pillars that drive growth. HN, which cares more about discussion quality than growth, tries to detect and suppress these (with exceptions like criticism of HN itself, of YC companies like Stripe, etc., to ensure a lack of bias). Any social media site that aims to grow does the opposite; they implement a ranked feed that puts the content that is most enraging and most engaging in front of the people its algorithms predict will be the most enraged and engaged by it. For example, let's say you're in a country with very high racial/religious/factonal tensions, with regular calls for violence, etc. What's the most engaging content? Well, that would be content calling for the death of your enemies, so you get things a livestream of someone calling for the death of the other faction and then grabbing someone and beating them shown to a lot of people. After all, what's more engaging than a beatdown of your sworn enemy? A theme of Broken Code is that someone will find some harmful content they want to suppress, but then get overruled because that would reduce engagement and growth. HN has no such goal, so it has no problem suppressing or eliminating content that HN deems to be harmful.

Another thing you can do if growth isn't your primary goal is to deliberately make user-signups high friction. HN adds does a little bit of this by having a "login" link but not a "sign up" link, and sites like lobste.rs and metafilter do even more of this.

Appendix: Theory vs. practice

In the main doc, we noted that big company employees often say that it's impossible to provide better support for theoretical reason X, without ever actually looking into how one provides support or what companies that provide good support do. When the now-$1T were the size where many companies do provide good support, these companies also did not provide good support, so this doesn't seem to come from size since these huge companies didn't even attempt to provide good support, then or now. This theoretical, plausible sounding, reason doesn't really hold up in practice.

This is generally the case for theoretical discussions on disceconomies of scale of large tech companies. Another example is an idea mentioned at the start of this doc, that being a larger target has a larger impact than having more sophisticated ML. A standard extension of this idea that I frequently hear is that big companies actually do have the best anti-spam and anti-fraud, but they're also subject to the most sophisticated attacks. I've seen this used as a justification for why big companies seem to have worst anti-spam and anti-fraud than a forum like HN. While it's likely true that big companies are subject to the most sophisticated attacks, if this whole idea held and it were the case that their systems were really good, it would be harder, in absolute terms, to spam or scam people on reddit and Facebook than on HN, but that's not the case at all.

If you actually try to spam, it's extremely easy to do so on large platforms and the most obvious things you might try will often work. As an experiment, I made a new reddit account and tried to get nonsense onto the front page and found this completely trivial. Similarly it's completely trivial to take over someone's Facebook account and post obvious scams for months to years, with extremely markers that they're scams, many people replying in concern that the account has been taken over and is running scams (unlike working in support and spamming reddit, I didn't try taking over people's Facebook accounts, but given people's password practices, it's very easy to take over an account, and given how Facebook responds to these takeovers when a friend's account is taken over, we can see that attacks that do the most naive thing possible, with zero sophistication, are not defeated), etc. In absolute terms, it's actually more difficult to get spammy or scammy content in front of eyeballs on HN than it is on reddit or Facebook.

The theoretical reason here is one that would be significant if large companies were even remotely close to doing the kind of job they could do with the resources they have, but we're not even close to being there.

To avoid belaboring the point in this already very long document, I've only listed a couple of examples here, but I find this pattern to hold true of almost every counterargument I've heard on this topic. If you actually look into it a bit, these theoretical arguments are classic cocktail party ideas that have little to no connection to reality.

A meta point here is that you absolutely cannot trust vaguely plausible sounding arguments from people on this since they virtually all of them fall apart when examined in practice. It seems quite reasonable to think that a business the size of reddit would have more sophisticated anti-spam systems than HN, which has a single person who both writes the code for the anti-spam systems and does the moderation. But the most naive and simplistic tricks you might use to put content on the front page work on reddit and don't work on HN. I'm not saying you can't defeat HN's system, but doing so would take a little bit of thought, which is not the case for reddit and Facebook. And likewise for support, where once you start talking to people about how to run a support org that's good for users, you immediately see that the most obvious things have not been seriously tried by big tech companies.

Appendix: How much should we trust journalists' summaries of leaked documents?

Overall, very little. As we discussed when we looked at the Cruise pedestrian accident report, almost every time I read a journalist's take on something (with rare exceptions like Zeynep), the journalist has a spin they're trying to put on the story and the impression you get from reading the story is quite different from the impression you get if you look at the raw source; it's fairly common that there's so much spin that the story says the opposite of what the source docs say. That's one issue.

The full topic here is big enough that it deserves its own document, so we'll just look at two examples. The first is one we briefly looked at, when Eugene Zarashaw, a director at Facebook, testified in a Special Master’s Hearing. He said

It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively

Eugene's testimony resulted in headlines like , "Facebook Has No Idea What Is Going on With Your Data", "Facebook engineers admit there’s no way to track all the data it collects on you" (with a stock photo of an overwhelmed person in a nest of cables, grabbing their head) and "Facebook Engineers: We Have No Idea Where We Keep All Your Personal Data", etc.

Even without any technical knowledge, any unbiased person can plainly see that these headlines are inaccurate. There's a big difference between it taking work to figure out exactly where all data, direct and derived, for each user exists, and having no idea where the data is. If I Google, logged out with no cookies, Eugene Zarashaw facebook testimony, every single above the fold result I get is misleading, false, clickbait, like the above.

For most people with relevant technical knowledge, who understand the kind of systems being discussed, Eugene Zarashaw's quote is not only not egregious, it's mundane, expected, and reasonable.

Despite this lengthy disclaimer, there are a few reasons that I feel comfortable citing Jeff Horwitz's Broken Code as well as a few stories that cover similar ground. The first is that, if you delete all of the references to these accounts, the points in this doc don't really change, just like they wouldn't change if you delete 50% of the user stories mentioned here. The second is that, at least for me, the most key part is the attitudes on display and not the specific numbers. I've seen similar attitudes in companies I've worked for and heard about them inside companies where I'm well connected via my friends and I could substitute similar stories from my friends, but it's nice to be able to use already-public sources instead of using anonymized stories from my friends, so the quotes about attitude are really just a stand-in for other stories which I can verify. The third reason is a bit too subtle to describe here, so we'll look at that when I expand this disclaimer into a standalone document.

If you're looking for work, Freshpaint is hiring (US remote) in engineering, sales, and recruiting. Disclaimer: I may be biased since I'm an investor, but they seem to have found product-market fit and are rapidly growing.

Appendix: Erin Kissane on Meta in Myanmar

Erin starts with

But once I started to really dig in, what I learned was so much gnarlier and grosser and more devastating than what I’d assumed. The harms Meta passively and actively fueled destroyed or ended hundreds of thousands of lives that might have been yours or mine, but for accidents of birth. I say “hundreds of thousands” because “millions” sounds unbelievable, but by the end of my research I came to believe that the actual number is very, very large.

To make sense of it, I had to try to go back, reset my assumptions, and try build up a detailed, factual understanding of what happened in this one tiny slice of the world’s experience with Meta. The risks and harms in Myanmar—and their connection to Meta’s platform—are meticulously documented. And if you’re willing to spend time in the documents, it’s not that hard to piece together what happened. Even if you never read any further, know this: Facebook played what the lead investigator on the UN Human Rights Council’s Independent International Fact-Finding Mission on Myanmar (hereafter just “the UN Mission”) called a “determining role” in the bloody emergence of what would become the genocide of the Rohingya people in Myanmar.2

From far away, I think Meta’s role in the Rohingya crisis can feel blurry and debatable—it was content moderation fuckups, right? In a country they weren’t paying much attention to? Unethical and probably negligent, but come on, what tech company isn’t, at some point?

As discussed above, I have not looked into the details enough to determine if the claim that Facebook played a "determining role" in genocide are correct, but at a meta-level (no pun intended), it seems plausible. Every comment I've seen that aims to be a direction refutation of Erin's position is actually pre-refuted by Erin in Erin's text, so it appears that very few people who are publicly commenting who disagree with Erin read the articles before commenting (or they've read them and failed to understand what Erin is saying) and, instead, are disagreeing based on something other than the actual content. It reminds me a bit of the responses to David Jackson's proof of the four color theorem. Some people thought it was, finally, a proof, and others thought it wasn't.. Something I found interesting at the time was that the people who thought it wasn't a proof had read the paper and thought it seemed flawed, whereas the people who thought it was a proof were going off of signals like David's track record or the prestige of his institution. At the time, without having read the paper myself, I guessed (with low confidence) that the proof was incorrect based on the meta-heuristic that thoughts from people who read the paper were stronger evidence than things like prestige. Similarly, I would guess that Erin's summary is at least roughly accurate and that Erin's endorsement of the UN HRC fact-finding mission is correct, although I have lower confidence in this than in my guess about the proof because making a positive claim like this is harder than finding a flaw and the area is one where evaluating a claim is significantly trickier.

Unlike with Broken Code, the source documents are available here and it would be possible to retrace Erin's steps, but since there's quite a bit of source material and the claims that would need additional reading and analysis to really be convinced and those claims don't play a determining role in the correctness of this document, I'll leave that for somebody else.

On the topic itself, Erin noted that some people at Facebook, when presented with evidence that something bad was happening, laughed it off as they simply couldn't believe that Facebook could be instrumental in something that bad. Ironically, this is fairly similar in tone and content to a lot of the "refutations" of Erin's articles which appear to have not actually read the articles.

The most substantive objections I've seen are around the edges which, such as

The article claims that "Arturo Bejar" was "head of engineering at Facebook", which is simply false. He appears to have been a Director, which is a manager title overseeing (typically) less than 100 people. That isn't remotely close to "head of engineering".

What Erin actually said was

... Arturo Bejar, one of Facebook’s heads of engineering

So the objection is technically incorrect in that it was not said that Arturo Bejar was head of engineering. And, if you read the entire set of articles, you'll see references like "Susan Benesch, head of the Dangerous Speech Project" and "the head of Deloitte in Myanmar", so it appears that the reason that Erin said that "one of Facebook’s heads of engineering" is that Erin is using the term head colloquially here (and note that the it isn't capitalized, as a title might be), to mean that Arturo was in charge of something.

There is a form of the above objection that's technically correct — for an engineer at a big tech company, the term Head of Engineering will generally call to mind an executive who all engineers transitively report into (or, in cases where there are large pillars, perhaps one of a few such people). Someone who's fluent in internal tech company lingo would probably not use this phrasing, even when writing for lay people, but this isn't strong evidence of factual errors in the article even if, in an ideal world, journalists would be fluent in the domain-specific connotations of every phrase.

The person's objection continues with

I point this out because I think it calls into question some of the accuracy of how clearly the problem was communicated to relevant people at Facebook.

It isn't enough for someone to tell random engineers or Communications VPs about a complex social problem.

On the topic of this post, diseconomies of scale, this objection, if correct, actually supports the post. According to Arturo's LinkedIn, he was "the leader for Integrity and Care Facebook", and the book Broken Code discusses his role at length, which is very closely related to the topic of Meta in Myanmar. Arturo is not, in fact, a "random engineers or Communications VP".

Anway, Erin documents that Facebook was repeatedly warned about what was happening, for years. These warnings went well beyond the standard reporting of bad content and fake accounts (although those were also done), and included direct conversations with directors, VPs, and other leaders. These warnings were dismissed and it seems that people thought that their existing content moderation systems were good enough, even in the face of fairly strong evidence that this was not the case.

Reuters notes that one of the examples Schissler gives Meta was a Burmese Facebook Page called, “We will genocide all of the Muslims and feed them to the dogs.” 48

None of this seems to get through to the Meta employees on the line, who are interested in…cyberbullying. Frenkel and Kang write that the Meta employees on the call “believed that the same set of tools they used to stop a high school senior from intimidating an incoming freshman could be used to stop Buddhist monks in Myanmar.”49

Aela Callan later tells Wired that hate speech seemed to be a “low priority” for Facebook, and that the situation in Myanmar, “was seen as a connectivity opportunity rather than a big pressing problem.”50

The details make this sound worse than a small excerpt, so I recommend reading the entire thing, but with respect to the discussion about resources, a key issue is that even after Meta decided to take some kind of action, the result was:

As the Burmese civil society people in the private Facebook group finally learn, Facebook has a single Burmese-speaking moderator—a contractor based in Dublin—to review everything that comes in. The Burmese-language reporting tool is, as Htaike Htaike Aung and Victoire Rio put it in their timeline, “a road to nowhere."

Since this was 2014, it's not fair to say that Meta could've spent the $50B metaverse dollars and hired 1.6 million moderators, but in 2014, it was still the 4th largest tech company in the world, worth $217B, with a net profit of $3B/yr, Meta would've "only" been able to afford something like 100k moderators and support staff if paid at a globally very generous loaded cost of $30k/yr (e.g., Jacobin notes that Meta's Kenyan moderators are paid $2/hr and don't get benefits). Myanmar's share of the global population was 0.7% and, let's say that you consider a developing genocide to be low priority and don't think that additional resources should be deployed to prevent or stop it and want to allocate a standard moderation share, then we have "only" have capacity for 700 generously paid moderation and support staff for Myanmar.

On the other side of the fence, there actually were 700 people:

in the years before the coup, it already had an internal adversary in the military that ran a professionalized, Russia-trained online propaganda and deception operation that maxed out at about 700 people, working in shifts to manipulate the online landscape and shout down opposing points of view. It’s hard to imagine that this force has lessened now that the genocidaires are running the country.

These folks didn't have the vaunted technology that Zuckerberg says that smaller companies can't match, but it turns out you don't need billions of dollars of technology when it's 700 on 1 and the 1 is using tools that were developed for a different purpose.

As you'd expect if you've ever interacted with the reporting system for a huge tech company, from the outside, nothing people tried worked:

They report posts and never hear anything. They report posts that clearly call for violence and eventually hear back that they’re not against Facebook’s Community Standards. This is also true of the Rohingya refugees Amnesty International interviews in Bangladesh

In the 40,000 word summary, Erin also digs through whistleblower reports to find things like

…we’re deleting less than 5% of all of the hate speech posted to Facebook. This is actually an optimistic estimate—previous (and more rigorous) iterations of this estimation exercise have put it closer to 3%, and on V&I [violence and incitement] we’re deleting somewhere around 0.6%…we miss 95% of violating hate speech.

and

[W]e do not … have a model that captures even a majority of integrity harms, particularly in sensitive areas … We only take action against approximately 2% of the hate speech on the platform. Recent estimates suggest that unless there is a major change in strategy, it will be very difficult to improve this beyond 10-20% in the short-medium term

and

While Hate Speech is consistently ranked as one of the top abuse categories in the Afghanistan market, the action rate for Hate Speech is worryingly low at 0.23 per cent.

To be clear, I'm not saying that Facebook has a significantly worse rate of catching bad content than other platforms of similar or larger size. As we noted above, large tech companies often have fairly high false positive and false negative rates and have employees who dismiss concerns about this, saying that things are fine.

Appendix: elsewhere

Appendix: Moderation and filtering fails

Since I saw Zuck's statement about how only large companies (and the larger the better) can possibly do good moderation, anti-fraud, anti-spam, etc., I've been collecting links I run across when doing normal day-to-browsing of failures by large companies. If I deliberately looked for failures, I'd have a lot more. And, for some reason, some companies don't really trigger my radar for this so, for example, even though I see stories about AirBnB issues all the time, it didn't occur to me to collect them until I started writing this post, so there are only a few AirBnB fails here, even though they'd be up there with Uber in failure count if I actually recorded the links I saw.

These are so frequent that, out of eight draft readers, at least two draft readers ran into an issue while reading the draft of this doc. Peter Bhat Harkins reported:

Well, I received a keychron keyboard a few days ago. I ordered a used K1 v5 (Keychron does small, infrequent production runs so it was out of stock everywhere). I placed the order on KeyChron's official Amazon store, fulfilled by Amazon. After some examination, I've received a v4. It's the previous gen mechanical switch instead of the current optical switch. Someone apparently peeled off the sticker with the model and serial number and one key stabilizer is broken from wear, which strongly implies someone bought a v5 and returned a v4 they already owned. Apparently this is a common scam on Amazon now.

In the other case, an anonymous reader created a Gmail account to used as a shared account for them and their partner, so they could get shared emails from local services. I know a number of people who've done this and this usually works fine, but in their case, after they used this email to set up a few services, Google decided that their account was suspicious:

Verify your identity

We’ve detected unusual activity on the account you’re trying to access. To continue, please follow the instructions below.

Provide a phone number to continue. We’ll send a verification code you can use to sign in.

Providing the phone number they used to sign up for the account resulted in

This phone number has already been used too many times for verification.

For whatever reason, even though this number was provided at account creation, using this apparently illegal number didn't result in the account being banned until it had been used for a while and the email address had been used to sign up for some services. Luckily, these were local services by small companies, so this issue could be fixed by calling them up. I've seen something similar happen with services that don't require you to provide a phone number on sign-up, but then lock and effectively ban the account unless you provide a phone number later, but I've never seen a case where the provided phone number turned out to not work after a day or two. The message above can be read two ways, the other way being that the phone number was allowed but had just recently been used to receive too many verification codes but, in recent history, the phone number had only once been used to receive a code, and that was the verification code necessary to attach a (required) phone number to the account in the first place.

I also had a quality control failure from Amazon, when I ordered a 10 pack of Amazon Basics power strips and the first one I pulled out had its cable covered in solder. I wonder what sort of process could leave solder, likely lead-based solder (although I didn't test it) all over the outside of one of these and wonder if I need to wash every Amazon Basics electronics item I get if I don't want lead dust getting all over my apartment. And, of course, since this is constant, I had many spam emails get through Gmail's spam filter and hit my inbox, and multiple ham emails get filtered into spam, including the classic case where I emailed someone and their reply to me went to spam; from having talked to them about it previously, I have no doubt that most of my draft readers who use Gmail also had something similar happen to them and that this is so common they didn't even find it worth remarking on.

Anyway, below, in a few cases, I've mentioned when commenters blame the user even though the issue is clearly not the user's fault. I haven't done this even close to exhaustively, so the lack of such a comment from me shouldn't be read as the lack of the standard "the user must be at fault" response from people.

Google

Facebook (Meta)

Amazon

Microsoft

This includes GitHub, LinkedIn, Activision, etc.

Stripe

Uber

Cloudflare

Shopify

Twitter (X)

I dropped most of the Twitter stories since there are so many after the acquisition that it seems silly to list them, but I've kept a few random ones.

Apple

DoorDash

Walmart

Airbnb

I've seen a ton of these but, for some reason, it didn't occur to me to add them to my list, so I don't have a lot of examples even though I've probably seen three times as many of these as I've seen Uber horror stories.

Appendix: Jeff Horwitz's Broken Code

Below are a few relevant excerpts. This is intended to be analogous to Zvi Mowshowitz's Quotes from Moral Mazes, which gives you an idea of what's in the book but is definitely not a replacement for reading the book. If these quotes are interesting, I recommend reading the book!

The former employees who agreed to speak to me said troubling things from the get-go. Facebook’s automated enforcement systems were flatly incapable of performing as billed. Efforts to engineer growth had inadvertently rewarded political zealotry. And the company knew far more about the negative effects of social media usage than it let on.


as the election progressed, the company started receiving reports of mass fake accounts, bald-faced lies on campaign-controlled pages, and coordinated threats of violence against Duterte critics. After years in politics, Harbath wasn’t naive about dirty tricks. But when Duterte won, it was impossible to deny that Facebook’s platform had rewarded his combative and sometimes underhanded brand of politics. The president-elect banned independent media from his inauguration—but livestreamed the event on Facebook. His promised extrajudicial killings began soon after.

A month after Duterte’s May 2016 victory came the United Kingdom’s referendum to leave the European Union. The Brexit campaign had been heavy on anti-immigrant sentiment and outright lies. As in the Philippines, the insurgent tactics seemed to thrive on Facebook—supporters of the “Leave” camp had obliterated “Remain” supporters on the platform. ... Harbath found all that to be gross, but there was no denying that Trump was successfully using Facebook and Twitter to short-circuit traditional campaign coverage, garnering attention in ways no campaign ever had. “I mean, he just has to go and do a short video on Facebook or Instagram and then the media covers it,” Harbath had marveled during a talk in Europe that spring. She wasn’t wrong: political reporters reported not just the content of Trump’s posts but their like counts.

Did Facebook need to consider making some effort to fact-check lies spread on its platform? Harbath broached the subject with Adam Mosseri, then Facebook’s head of News Feed.

“How on earth would we determine what’s true?” Mosseri responded. Depending on how you looked at it, it was an epistemic or a technological conundrum. Either way, the company chose to punt when it came to lies on its platform.


Zuckerberg believed math was on Facebook’s side. Yes, there had been misinformation on the platform—but it certainly wasn’t the majority of content. Numerically, falsehoods accounted for just a fraction of all news viewed on Facebook, and news itself was just a fraction of the platform’s overall content. That such a fraction of a fraction could have thrown the election was downright illogical, Zuckerberg insisted.. ... But Zuckerberg was the boss. Ignoring Kornblut’s advice, he made his case the following day during a live interview at Techonomy, a conference held at the Ritz-Carlton in Half Moon Bay. Calling fake news a “very small” component of the platform, he declared the possibility that it had swung the election “a crazy idea.” ... A favorite saying at Facebook is that “Data Wins Arguments.” But when it came to Zuckerberg’s argument that fake news wasn’t a major problem on Facebook, the company didn’t have any data. As convinced as the CEO was that Facebook was blameless, he had no evidence of how “fake news” came to be, how it spread across the platform, and whether the Trump campaign had made use of it in their Facebook ad campaigns. ... One week after the election, BuzzFeed News reporter Craig Silverman published an analysis showing that, in the final months of the election, fake news had been the most viral election-related content on Facebook. A story falsely claiming that the pope had endorsed Trump had gotten more than 900,000 likes, reshares, and comments—more engagement than even the most widely shared stories from CNN, the New York Times, or the Washington Post. The most popular falsehoods, the story showed, had been in support of Trump.

It was a bombshell. Interest in the term “fake news” spiked on Google the day the story was published—and it stayed high for years, first as Trump’s critics cited it as an explanation for the president-elect’s victory, and then as Trump co-opted the term to denigrate the media at large. ... even as the company’s Communications staff had quibbled with Silverman’s methodology, executives had demanded that News Feed’s data scientists replicate it. Was it really true that lies were the platform’s top election-related content?

A day later, the staffers came back with an answer: almost.

A quick and dirty review suggested that the data BuzzFeed was using had been slightly off, but the claim that partisan hoaxes were trouncing real news in Facebook’s News Feed was unquestionably correct. Bullshit peddlers had a big advantage over legitimate publications—their material was invariably compelling and exclusive. While scores of mainstream news outlets had written rival stories about Clinton’s leaked emails, for instance, none of them could compete with the headline “WikiLeaks CONFIRMS Hillary Sold Weapons to ISIS.”


The engineers weren’t incompetent—just applying often-cited company wisdom that “Done Is Better Than Perfect.” Rather than slowing down, Maurer said, Facebook preferred to build new systems capable of minimizing the damage of sloppy work, creating firewalls to prevent failures from cascading, discarding neglected data before it piled up in server-crashing queues, and redesigning infrastructure so that it could be readily restored after inevitable blowups.

The same culture applied to product design, where bonuses and promotions were doled out to employees based on how many features they “shipped”—programming jargon for incorporating new code into an app. Conducted semiannually, these “Performance Summary Cycle” reviews incented employees to complete products within six months, even if it meant the finished product was only minimally viable and poorly documented. Engineers and data scientists described living with perpetual uncertainty about where user data was being collected and stored—a poorly labeled data table could be a redundant file or a critical component of an important product. Brian Boland, a longtime vice president in Facebook’s Advertising and Partnerships divisions, recalled that a major data-sharing deal with Amazon once collapsed because Facebook couldn’t meet the retailing giant’s demand that it not mix Amazon’s data with its own.

“Building things is way more fun than making things secure and safe,” he said of the company’s attitude. “Until there’s a regulatory or press fire, you don’t deal with it.”


Nowhere in the system was there much place for quality control. Instead of trying to restrict problem content, Facebook generally preferred to personalize users’ feeds with whatever it thought they would want to see. Though taking a light touch on moderation had practical advantages—selling ads against content you don’t review is a great business—Facebook came to treat it as a moral virtue, too. The company wasn’t failing to supervise what users did—it was neutral.

Though the company had come to accept that it would need to do some policing, executives continued to suggest that the platform would largely regulate itself. In 2016, with the company facing pressure to moderate terrorism recruitment more aggressively, Sheryl Sandberg had told the World Economic Forum that the platform did what it could, but that the lasting solution to hate on Facebook was to drown it in positive messages.

“The best antidote to bad speech is good speech,” she declared, telling the audience how German activists had rebuked a Neo-Nazi political party’s Facebook page with “like attacks,” swarming it with messages of tolerance.

Definitionally, the “counterspeech” Sandberg was describing didn’t work on Facebook. However inspiring the concept, interacting with vile content would have triggered the platform to distribute the objectionable material to a wider audience.


​​... in an internal memo by Andrew “Boz” Bosworth, who had gone from being one of Mark Zuckerberg’s TAs at Harvard to one of his most trusted deputies and confidants at Facebook. Titled “The Ugly,” Bosworth wrote the memo in June 2016, two days after the murder of a Chicago man was inadvertently livestreamed on Facebook. Facing calls for the company to rethink its products, Bosworth was rallying the troops.

“We talk about the good and the bad of our work often. I want to talk about the ugly,” the memo began. Connecting people created obvious good, he said—but doing so at Facebook’s scale would produce harm, whether it was users bullying a peer to the point of suicide or using the platform to organize a terror attack.

That Facebook would inevitably lead to such tragedies was unfortunate, but it wasn’t the Ugly. The Ugly, Boz wrote, was that the company believed in its mission of connecting people so deeply that it would sacrifice anything to carry it out.

“That’s why all the work we do in growth is justified. All the questionable contact importing practices. All the subtle language that helps people stay searchable by friends. All of the work we do to bring more communication in. The work we will likely have to do in China some day. All of it,” Bosworth wrote.


Every team responsible for ranking or recommending content rushed to overhaul their systems as fast as they could, setting off an explosion in the complexity of Facebook’s product. Employees found that the biggest gains often came not from deliberate initiatives but from simple futzing around. Rather than redesigning algorithms, which was slow, engineers were scoring big with quick and dirty machine learning experiments that amounted to throwing hundreds of variants of existing algorithms at the wall and seeing which versions stuck—which performed best with users. They wouldn’t necessarily know why a variable mattered or how one algorithm outperformed another at, say, predicting the likelihood of commenting. But they could keep fiddling until the machine learning model produced an algorithm that statistically outperformed the existing one, and that was good enough.
... in Facebook’s efforts to deploy a classifier to detect pornography, Arturo Bejar recalled, the system routinely tried to cull images of beds. Rather than learning to identify people screwing, the model had instead taught itself to recognize the furniture on which they most often did ... Similarly fundamental errors kept occurring, even as the company came to rely on far more advanced AI techniques to make far weightier and complex decisions than “porn/not porn.” The company was going all in on AI, both to determine what people should see, and also to solve any problems that might arise.
Willner happened to read an NGO report documenting the use of Facebook to groom and arrange meetings with dozens of young girls who were then kidnapped and sold into sex slavery in Indonesia. Zuckerberg was working on his public speaking skills at the time and had asked employees to give him tough questions. So, at an all-hands meeting, Willner asked him why the company had allocated money for its first-ever TV commercial—a recently released ninety-second spot likening Facebook to chairs and other helpful structures—but no budget for a staffer to address its platform’s known role in the abduction, rape, and occasional murder of Indonesian children.

Zuckerberg looked physically ill. He told Willner that he would need to look into the matter ... Willner said, the company was hopelessly behind in the markets where she believed Facebook had the highest likelihood of being misused. When she left Facebook in 2013, she had concluded that the company would never catch up.


Within a few months, Facebook laid off the entire Trending Topics team, sending a security guard to escort them out of the building. A newsroom announcement said that the company had always hoped to make Trending Topics fully automated, and henceforth it would be. If a story topped Facebook’s metrics for viral news, it would top Trending Topics.

The effects of the switch were not subtle. Freed from the shackles of human judgment, Facebook’s code began recommending users check out the commemoration of “National Go Topless Day,” a false story alleging that Megyn Kelly had been sacked by Fox News, and an only-too-accurate story titled “Man Films Himself Having Sex with a McChicken Sandwich.”

Setting aside the feelings of McDonald’s social media team, there were reasons to doubt that the engagement on that final story reflected the public’s genuine interest in sandwich-screwing: much of the engagement was apparently coming from people wishing they’d never seen such accursed content. Still, Zuckerberg preferred it this way. Perceptions of Facebook’s neutrality were paramount; dubious and distasteful was better than biased.

“Zuckerberg said anything that had a human in the loop we had to get rid of as much as possible,” the member of the early polarization team recalled.

Among the early victims of this approach was the company’s only tool to combat hoaxes. For more than a decade, Facebook had avoided removing even the most obvious bullshit, which was less a principled stance and more the only possible option for the startup. “We were a bunch of college students in a room,” said Dave Willner, Charlotte Willner’s husband and the guy who wrote Facebook’s first content standards. “We were radically unequipped and unqualified to decide the correct history of the world.”

But as the company started churning out billions of dollars in annual profit, there were, at least, resources to consider the problem of fake information. In early 2015, the company had announced that it had found a way to combat hoaxes without doing fact-checking—that is, without judging truthfulness itself. It would simply suppress content that users disproportionately reported as false.

Nobody was so naive as to think that this couldn’t get contentious, or that the feature wouldn’t be abused. In a conversation with Adam Mosseri, one engineer asked how the company would deal, for example, with hoax “debunkings” of manmade global warming, which were popular on the American right. Mosseri acknowledged that climate change would be tricky but said that was not cause to stop: “You’re choosing the hardest case—most of them won’t be that hard.”

Facebook publicly revealed its anti-hoax work to little fanfare in an announcement that accurately noted that users reliably reported false news. What it omitted was that users also reported as false any news story they didn’t like, regardless of its accuracy.

To stem a flood of false positives, Facebook engineers devised a workaround: a “whitelist” of trusted publishers. Such safe lists are common in digital advertising, allowing jewelers to buy preauthorized ads on a host of reputable bridal websites, for example, while excluding domains like www.wedddings.com. Facebook’s whitelisting was pretty much the same: they compiled a generously large list of recognized news sites whose stories would be treated as above reproach.

The solution was inelegant, and it could disadvantage obscure publishers specializing in factual but controversial reporting. Nonetheless, it effectively diminished the success of false viral news on Facebook. That is, until the company faced accusations of bias surrounding Trending Topics. Then Facebook preemptively turned it off.

The disabling of Facebook’s defense against hoaxes was part of the reason fake news surged in the fall of 2016.


Gomez-Uribe’s team hadn’t been tasked with working on Russian interference, but one of his subordinates noted something unusual: some of the most hyperactive accounts seemed to go entirely dark on certain days of the year. Their downtime, it turned out, corresponded with a list of public holidays in the Russian Federation.

“They respect holidays in Russia?” he recalled thinking. “Are we all this fucking stupid?”

But users didn’t have to be foreign trolls to promote problem posts. An analysis by Gomez-Uribe’s team showed that a class of Facebook power users tended to favor edgier content, and they were more prone to extreme partisanship. They were also, hour to hour, more prolific—they liked, commented, and reshared vastly more content than the average user. These accounts were outliers, but because Facebook recommended content based on aggregate engagement signals, they had an outsized effect on recommendations. If Facebook was a democracy, it was one in which everyone could vote whenever they liked and as frequently as they wished. ... hyperactive users tended to be more partisan and more inclined to share misinformation, hate speech, and clickbait,


At Facebook, he realized, nobody was responsible for looking under the hood. “They’d trust the metrics without diving into the individual cases,” McNally said. “It was part of the ‘Move Fast’ thing. You’d have hundreds of launches every year that were only driven by bottom-line metrics.”

Something else worried McNally. Facebook’s goal metrics tended to be calculated in averages.

“It is a common phenomenon in statistics that the average is volatile, so certain pathologies could fall straight out of the geometry of the goal metrics,” McNally said. In his own reserved, mathematically minded way, he was calling Facebook’s most hallowed metrics crap. Making decisions based on metrics alone, without carefully studying the effects on actual humans, was reckless. But doing it based on average metrics was flat-out stupid. An average could rise because you did something that was broadly good for users, or it could go up because normal people were using the platform a tiny bit less and a small number of trolls were using Facebook way more.

Everyone at Facebook understood this concept—it’s the difference between median and mean, a topic that is generally taught in middle school. But, in the interest of expediency, Facebook’s core metrics were all based on aggregate usage. It was as if a biologist was measuring the strength of an ecosystem based on raw biomass, failing to distinguish between healthy growth and a toxic algae bloom.


One distinguishing feature was the shamelessness of fake news publishers’ efforts to draw attention. Along with bad information, their pages invariably featured clickbait (sensationalist headlines) and engagement bait (direct appeals for users to interact with content, thereby spreading it further).

Facebook already frowned on those hype techniques as a little spammy, but truth be told it didn’t really do much about them. How much damage could a viral “Share this if you support the troops” post cause?


Facebook’s mandate to respect users’ preferences posed another challenge. According to the metrics the platform used, misinformation was what people wanted. Every metric that Facebook used showed that people liked and shared stories with sensationalistic and misleading headlines.

McNally suspected the metrics were obscuring the reality of the situation. His team set out to demonstrate that this wasn’t actually true. What they found was that, even though users routinely engaged with bait content, they agreed in surveys that such material was of low value to them. When informed that they had shared false content, they experienced regret. And they generally considered fact-checks to contain useful information.


every time a well-intentioned proposal of that sort blew up in the company’s face, the people working on misinformation lost a bit of ground. In the absence of a coherent, consistent set of demands from the outside world, Facebook would always fall back on the logic of maximizing its own usage metrics.

“If something is not going to play well when it hits mainstream media, they might hesitate when doing it,” McNally said. “Other times we were told to take smaller steps and see if anybody notices. The errors were always on the side of doing less.” ... “For people who wanted to fix Facebook, polarization was the poster child of ‘Let’s do some good in the world,’ ” McNally said. “The verdict came back that Facebook’s goal was not to do that work.”


When the ranking team had begun its work, there had been no question that Facebook was feeding its users overtly false information at a rate that vastly outstripped any other form of media. This was no longer the case (even though the company would be raked over the coals for spreading “fake news” for years to come).

Ironically, Facebook was in a poor position to boast about that success. With Zuckerberg having insisted throughout that fake news accounted for only a trivial portion of content, Facebook couldn’t celebrate that it might be on the path of making the claim true.


multiple members of both teams recalled having had the same response when they first learned of MSI’s new engagement weightings: it was going to make people fight. Facebook’s good intent may have been genuine, but the idea that turbocharging comments, reshares, and emojis would have unpleasant effects was pretty obvious to people who had, for instance, worked on Macedonian troll farms, sensationalism, and hateful content.

Hyperbolic headlines and outrage bait were already well-recognized digital publishing tactics, on and off Facebook. They traveled well, getting reshared in long chains. Giving a boost to content that galvanized reshares was going to add an exponential component to the already-healthy rate at which such problem content spread. At a time when the company was trying to address purveyors of misinformation, hyperpartisanship, and hate speech, it had just made their tactics more effective.

Multiple leaders inside Facebook’s Integrity team raised concerns about MSI with Hegeman, who acknowledged the problem and committed to trying to fine-tune MSI later. But adopting MSI was a done deal, he said—Zuckerberg’s orders.

Even non-Integrity staffers recognized the risk. When a Growth team product manager asked if the change meant News Feed would favor more controversial content, the manager of the team responsible for the work acknowledged it very well could.


The effect was more than simply provoking arguments among friends and relatives. As a Civic Integrity researcher would later report back to colleagues, Facebook’s adoption of MSI appeared to have gone so far as to alter European politics. “Engagement on positive and policy posts has been severely reduced, leaving parties increasingly reliant on inflammatory posts and direct attacks on their competitors,” a Facebook social scientist wrote after interviewing political strategists about how they used the platform. In Poland, the parties described online political discourse as “a social-civil war.” One party’s social media management team estimated that they had shifted the proportion of their posts from 50/50 positive/negative to 80 percent negative and 20 percent positive, explicitly as a function of the change to the algorithm. Major parties blamed social media for deepening political polarization, describing the situation as “unsustainable.”

The same was true of parties in Spain. “They have learnt that harsh attacks on their opponents net the highest engagement,” the researcher wrote. “From their perspective, they are trapped in an inescapable cycle of negative campaigning by the incentive structures of the platform.”

If Facebook was making politics more combative, not everyone was upset about it. Extremist parties proudly told the researcher that they were running “provocation strategies” in which they would “create conflictual engagement on divisive issues, such as immigration and nationalism.”

To compete, moderate parties weren’t just talking more confrontationally. They were adopting more extreme policy positions, too. It was a matter of survival. “While they acknowledge they are contributing to polarization, they feel like they have little choice and are asking for help,” the researcher wrote.


Facebook’s most successful publishers of political content were foreign content farms posting absolute trash, stuff that made About.com’s old SEO chum look like it belonged in the New Yorker.

Allen wasn’t the first staffer to notice the quality problem. The pages were an outgrowth of the fake news publishers that Facebook had battled in the wake of the 2016 election. While fact-checks and other crackdown efforts had made it far harder for outright hoaxes to go viral, the publishers had regrouped. Some of the same entities that BuzzFeed had written about in 2016—teenagers from a small Macedonian mountain town called Veles—were back in the game. How had Facebook’s news distribution system been manipulated by kids in a country with a per capita GDP of $5,800?


When reviewing troll farm pages, he noticed something—their posts usually went viral. This was odd. Competition for space in users’ News Feeds meant that most pages couldn’t reliably get their posts in front of even those people who deliberately chose to follow them. But with the help of reshares and the News Feed algorithms, the Macedonian troll farms were routinely reaching huge audiences. If having a post go viral was hitting the attention jackpot, then the Macedonians were winning every time they put a buck into Facebook’s slot machine.

The reason the Macedonians’ content was so good was that it wasn’t theirs. Virtually every post was either aggregated or stolen from somewhere else on the internet. Usually such material came from Reddit or Twitter, but the Macedonians were just ripping off content from other Facebook pages, too, and reposting it to their far larger audiences. This worked because, on Facebook, originality wasn’t an asset; it was a liability. Even for talented content creators, most posts turned out to be duds. But things that had already gone viral nearly always would do so again.


Allen began a note about the problem from the summer of 2018 with a reminder. “The mission of Facebook is to empower people to build community. This is a good mission,” he wrote, before arguing that the behavior he was describing exploited attempts to do that. As an example, Allen compared a real community—a group known as the National Congress of American Indians. The group had clear leaders, produced original programming, and held offline events for Native Americans. But, despite NCAI’s earnest efforts, it had far fewer fans than a page titled “Native American Proub” [sic] that was run out of Vietnam. The page’s unknown administrators were using recycled content to promote a website that sold T-shirts.

“They are exploiting the Native American Community,” Allen wrote, arguing that, even if users liked the content, they would never choose to follow a Native American pride page that was secretly run out of Vietnam. As proof, he included an appendix of reactions from users who had wised up. “If you’d like to read 300 reviews from real users who are very upset about pages that exploit the Native American community, here is a collection of 1 star reviews on Native American ‘Community’ and ‘Media’ pages,” he concluded.

This wasn’t a niche problem. It was increasingly the default state of pages in every community. Six of the top ten Black-themed pages—including the number one page, “My Baby Daddy Ain’t Shit”—were troll farms. The top fourteen English-language Christian- and Muslim-themed pages were illegitimate. A cluster of troll farms peddling evangelical content had a combined audience twenty times larger than the biggest authentic page.

“This is not normal. This is not healthy. We have empowered inauthentic actors to accumulate huge followings for largely unknown purposes,” Allen wrote in a later note. “Mostly, they seem to want to skim a quick buck off of their audience. But there are signs they have been in contact with the IRA.”

So how bad was the problem? A sampling of Facebook publishers with significant audiences found that a full 40 percent relied on content that was either stolen, aggregated, or “spun”—meaning altered in a trivial fashion. The same thing was true of Facebook video content. One of Allen’s colleagues found that 60 percent of video views went to aggregators.

The tactics were so well-known that, on YouTube, people were putting together instructional how-to videos explaining how to become a top Facebook publisher in a matter of weeks. “This is where I’m snagging videos from YouTube and I’ll re-upload them to Facebook,” said one guy in a video Allen documented, noting that it wasn’t strictly necessary to do the work yourself. “You can pay 20 dollars on Fiverr for a compilation—‘Hey, just find me funny videos on dogs, and chain them together into a compilation video.’ ”

Holy shit, Allen thought. Facebook was losing in the later innings of a game it didn’t even understand it was playing. He branded the set of winning tactics “manufactured virality.”

“What’s the easiest (lowest effort) way to make a big Facebook Page?” Allen wrote in an internal slide presentation. “Step 1: Find an existing, engaged community on [Facebook]. Step 2: Scrape/Aggregate content popular in that community. Step 3: Repost the most popular content on your Page.”


Allen’s research kicked off a discussion. That a top page for American Vietnam veterans was being run from overseas—from Vietnam, no less—was just flat-out embarrassing. And unlike killing off Page Like ads, which had been a nonstarter for the way it alienated certain internal constituencies, if Allen and his colleagues could work up ways to systematically suppress trash content farms—material that was hardly exalted by any Facebook team—getting leadership to approve them might be a real possibility.

This was where Allen ran up against that key Facebook tenet, “Assume Good Intent.” The principle had been applied to colleagues, but it was meant to be just as applicable to Facebook’s billions of users. In addition to being a nice thought, it was generally correct. The overwhelming majority of people who use Facebook do so in the name of connection, entertainment, and distraction, and not to deceive or defraud. But, as Allen knew from experience, the motto was hardly a comprehensive guide to living, especially when money was involved.


With the help of another data scientist, Allen documented the inherent traits of crap publishers. They aggregated content. They went viral too consistently. They frequently posted engagement bait. And they relied on reshares from random users, rather than cultivating a dedicated long-term audience.

None of these traits warranted severe punishment by itself. But together they added up to something damning. A 2019 screening for these features found 33,000 entities—a scant 0.175 percent of all pages—that were receiving a full 25 percent of all Facebook page views. Virtually none of them were “managed,” meaning controlled by entities that Facebook’s Partnerships team considered credible media professionals, and they accounted for just 0.14 percent of Facebook revenue.


After it was bought, CrowdTangle was no longer a company but a product, available to media companies at no cost. However much publishers were angry with Facebook, they loved Silverman’s product. The only mandate Facebook gave him was for his team to keep building things that made publishers happy. Savvy reporters looking for viral story fodder loved it, too. CrowdTangle could surface, for instance, an up-and-coming post about a dog that saved its owner’s life, material that was guaranteed to do huge numbers on social media because it was already heading in that direction.

CrowdTangle invited its formerly paying media customers to a party in New York to celebrate the deal. One of the media executives there asked Silverman whether Facebook would be using CrowdTangle internally as an investigative tool, a question that struck Silverman as absurd. Yes, it had offered social media platforms an early window into their own usage. But Facebook’s staff now outnumbered his own by several thousand to one. “I was like, ‘That’s ridiculous—I’m sure whatever they have is infinitely more powerful than what we have!’ ”

It took Silverman more than a year to reconsider that answer.


It was only as CrowdTangle started building tools to do this that the team realized just how little Facebook knew about its own platform. When Media Matters, a liberal media watchdog, published a report showing that MSI had been a boon for Breitbart, Facebook executives were genuinely surprised, sending around the article asking if it was true. As any CrowdTangle user would have known, it was.

Silverman thought the blindness unfortunate, because it prevented the company from recognizing the extent of its quality problem. It was the same point that Jeff Allen and a number of other Facebook employees had been hammering on. As it turned out, the person to drive it home wouldn’t come from inside the company. It would be Jonah Peretti, the CEO of BuzzFeed.

BuzzFeed had pioneered the viral publishing model. While “listicles” earned the publication a reputation for silly fluff in its early days, Peretti’s staff operated at a level of social media sophistication far above most media outlets, stockpiling content ahead of snowstorms and using CrowdTangle to find quick-hit stories that drew giant audiences.

In the fall of 2018, Peretti emailed Cox with a grievance: Facebook’s Meaningful Social Interactions ranking change was pressuring his staff to produce scuzzier content. BuzzFeed could roll with the punches, Peretti wrote, but nobody on his staff would be happy about it. Distinguishing himself from publishers who just whined about lost traffic, Peretti cited one of his platform’s recent successes: a compilation of tweets titled “21 Things That Almost All White People Are Guilty of Saying.” The list—which included “whoopsie daisy,” “get these chips away from me,” and “guilty as charged”—had performed fantastically on Facebook. What bothered Peretti was the apparent reason why. Thousands of users were brawling in the comments section over whether the item itself was racist.

“When we create meaningful content, it doesn’t get rewarded,” Peretti told Cox. Instead, Facebook was promoting “fad/junky science,” “extremely disturbing news,” “gross images,” and content that exploited racial divisions, according to a summary of Peretti’s email that circulated among Integrity staffers. Nobody at BuzzFeed liked producing that junk, Peretti wrote, but that was what Facebook was demanding. (In an illustration of BuzzFeed’s willingness to play the game, a few months later it ran another compilation titled “33 Things That Almost All White People Are Guilty of Doing.”)


As users’ News Feeds became dominated by reshares, group posts, and videos, the “organic reach” of celebrity pages began tanking. “My artists built up a fan base and now they can’t reach them unless they buy ads,” groused Travis Laurendine, a New Orleans–based music promoter and technologist, in a 2019 interview. A page with 10,000 followers would be lucky to reach more than a tiny percent of them.

Explaining why a celebrity’s Facebook reach was dropping even as they gained followers was hell for Partnerships, the team tasked with providing VIP service to notable users and selling them on the value of maintaining an active presence on Facebook. The job boiled down to convincing famous people, or their social media handlers, that if they followed a set of company-approved best practices, they would reach their audience. The problem was that those practices, such as regularly posting original content and avoiding engagement bait, didn’t actually work. Actresses who were the center of attention on the Oscars’ red carpet would have their posts beaten out by a compilation video of dirt bike crashes stolen from YouTube. ... Over time, celebrities and influencers began drifting off the platform, generally to sister company Instagram. “I don’t think people ever connected the dots,” Boland said.


“Sixty-four percent of all extremist group joins are due to our recommendation tools,” the researcher wrote in a note summarizing her findings. “Our recommendation systems grow the problem.”

This sort of thing was decidedly not supposed to be Civic’s concern. The team existed to promote civic participation, not police it. Still, a longstanding company motto was that “Nothing Is Someone Else’s Problem.” Chakrabarti and the researcher team took the findings to the company’s Protect and Care team, which worked on things like suicide prevention and bullying and was, at that point, the closest thing Facebook had to a team focused on societal problems.

Protect and Care told Civic there was nothing it could do. The accounts creating the content were real people, and Facebook intentionally had no rules mandating truth, balance, or good faith. This wasn’t someone else’s problem—it was nobody’s problem.


Even if the problem seemed large and urgent, exploring possible defenses against bad-faith viral discourse was going to be new territory for Civic, and the team wanted to start off slow. Cox clearly supported the team’s involvement, but studying the platform’s defenses against manipulation would still represent moonlighting from Civic’s main job, which was building useful features for public discussion online.

A few months after the 2016 election, Chakrabarti made a request of Zuckerberg. To build tools to study political misinformation on Facebook, he wanted two additional engineers on top of the eight he already had working on boosting political participation.

“How many engineers do you have on your team right now?” Zuckerberg asked. Chakrabarti told him. “If you want to do it, you’re going to have to come up with the resources yourself,” the CEO said, according to members of Civic. Facebook had more than 20,000 engineers—and Zuckerberg wasn’t willing to give the Civic team two of them to study what had happened during the election.


While acknowledging the possibility that social media might not be a force for universal good was a step forward for Facebook, discussing the flaws of the existing platform remained difficult even internally, recalled product manager Elise Liu.

“People don’t like being told they’re wrong, and they especially don’t like being told that they’re morally wrong,” she said. “Every meeting I went to, the most important thing to get in was ‘It’s not your fault. It happened. How can you be part of the solution? Because you’re amazing.’ 


“We do not and possibly never will have a model that captures even a majority of integrity harms, particularly in sensitive areas,” one engineer would write, noting that the company’s classifiers could identify only 2 percent of prohibited hate speech with enough precision to remove it.

Inaction on the overwhelming majority of content violations was unfortunate, Rosen said, but not a reason to change course. Facebook’s bar for removing content was akin to the standard of guilt beyond a reasonable doubt applied in criminal cases. Even limiting a post’s distribution should require a preponderance of evidence. The combination of inaccurate systems and a high burden of proof would inherently mean that Facebook generally didn’t enforce its own rules against hate, Rosen acknowledged, but that was by design.

“Mark personally values free expression first and foremost and would say this is a feature, not a bug,” he wrote.

Publicly, the company declared that it had zero tolerance for hate speech. In practice, however, the company’s failure to meaningfully combat it was viewed as unfortunate—but highly tolerable.


Myanmar, ruled by a military junta that exercised near-complete control until 2011, was the sort of place where Facebook was rapidly filling in for the civil society that the government had never allowed to develop. The app offered telecommunications services, real-time news, and opportunities for activism to a society unaccustomed to them.

In 2012, ethnic violence between the country’s dominant Buddhist majority and its Rohingya Muslim minority left around two hundred people dead and prompted tens of thousands of people to flee their homes. To many, the dangers posed by Facebook in the situation seemed obvious, including to Aela Callan, a journalist and documentary filmmaker who brought them to the attention of Elliot Schrage in Facebook’s Public Policy division in 2013. All the like-minded Myanmar Cassandras received a polite audience in Menlo Park, and little more. Their argument that Myanmar was a tinderbox was validated in 2014, when a hardline Buddhist monk posted a false claim on Facebook that a Rohingya man had raped a Buddhist woman, a provocation that produced clashes, killing two people. But with the exception of Bejar’s Compassion Research team and Cox—who was personally interested in Myanmar, privately funding independent news media there as a philanthropic endeavor—nobody at Facebook paid a great deal of attention.

Later accounts of the ignored warnings led many of the company’s critics to attribute Facebook’s inaction to pure callousness, though interviews with those involved in the cleanup suggest that the root problem was incomprehension. Human rights advocates were telling Facebook not just that its platform would be used to kill people but that it already had. At a time when the company assumed that users would suss out and shut down misinformation without help, however, the information proved difficult to absorb. The version of Facebook that the company’s upper ranks knew—a patchwork of their friends, coworkers, family, and interests—couldn’t possibly be used as a tool of genocide.

Facebook eventually hired its first Burmese-language content reviewer to cover whatever issues arose in the country of more than 50 million in 2015, and released a packet of flower-themed, peace-promoting digital stickers for Burmese users to slap on hateful posts. (The company would later note that the stickers had emerged from discussions with nonprofits and were “widely celebrated by civil society groups at the time.”) At the same time, it cut deals with telecommunications providers to provide Burmese users with Facebook access free of charge.

The first wave of ethnic cleansing began later that same year, with leaders of the country’s military announcing on Facebook that they would be “solving the problem” of the country’s Muslim minority. A second wave of violence followed and, in the end, 25,000 people were killed by the military and Buddhist vigilante groups, 700,000 were forced to flee their homes, and thousands more were raped and injured. The UN branded the violence a genocide.

Facebook still wasn’t responding. On its own authority, Gomez-Uribe’s News Feed Integrity team began collecting examples of the platform giving massive distribution to statements inciting violence. Even without Burmese-language skills, it wasn’t difficult. The torrent of anti-Rohingya hate and falsehoods from the Burmese military, government shills, and firebrand monks was not just overwhelming but overwhelmingly successful.

This was exploratory work, not on the Integrity Ranking team’s half-year roadmap. When Gomez-Uribe, along with McNally and others, pushed to reassign staff to better grasp the scope of Facebook’s problem in Myanmar, they were shot down.

“We were told no,” Gomez-Uribe recalled. “It was clear that leadership didn’t want to understand it more deeply.”

That changed, as it so often did, when Facebook’s role in the problem became public. A couple of weeks after the worst violence broke out, an international human rights organization condemned Facebook for inaction. Within seventy-two hours, Gomez-Uribe’s team was urgently asked to figure out what was going on.

When it was all over, Facebook’s negligence was clear. A UN report declared that “the response of Facebook has been slow and ineffective,” and an external human rights consultant that Facebook hired eventually concluded that the platform “has become a means for those seeking to spread hate and cause harm.”

In a series of apologies, the company acknowledged that it had been asleep at the wheel and pledged to hire more staffers capable of speaking Burmese. Left unsaid was why the company screwed up. The truth was that it had no idea what was happening on its platform in most countries.


Barnes was put in charge of “meme busting”—that is, combating the spread of viral hoaxes about Facebook, on Facebook. No, the company was not going to claim permanent rights to all your photos unless you reshared a post warning of the threat. And no, Zuckerberg was not giving away money to the people who reshared a post saying so. Suppressing these digital chain letters had an obvious payoff; they tarred Facebook’s reputation and served no purpose.

Unfortunately, restricting the distribution of this junk via News Feed wasn’t enough to sink it. The posts also spread via Messenger, in large part because the messaging platform was prodding recipients of the messages to forward them on to a list of their friends.

The Advocacy team that Barnes had worked on sat within Facebook’s Growth division, and Barnes knew the guy who oversaw Messenger forwarding. Armed with data showing that the current forwarding feature was flooding the platform with anti-Facebook crap, he arranged a meeting.

Barnes’s colleague heard him out, then raised an objection.

“It’s really helping us with our goals,” the man said of the forwarding feature, which allowed users to reshare a message to a list of their friends with just a single tap. Messenger’s Growth staff had been tasked with boosting the number of “sends” that occurred each day. They had designed the forwarding feature to encourage precisely the impulsive sharing that Barnes’s team was trying to stop.

Barnes hadn’t so much lost a fight over Messenger forwarding as failed to even start one. At a time when the company was trying to control damage to its reputation, it was also being intentionally agnostic about whether its own users were slandering it. What was important was that they shared their slander via a Facebook product.

“The goal was in itself a sacred thing that couldn’t be questioned,” Barnes said. “They’d specifically created this flow to maximize the number of times that people would send messages. It was a Ferrari, a machine designed for one thing: infinite scroll.”


Entities like Liftable Media, a digital media company run by longtime Republican operative Floyd Brown, had built an empire on pages that began by spewing upbeat clickbait, then pivoted to supporting Trump ahead of the 2016 election. To compound its growth, Liftable began buying up other spammy political Facebook pages with names like “Trump Truck,” “Patriot Update,” and “Conservative Byte,” running its content through them.

In the old world of media, the strategy of managing loads of interchangeable websites and Facebook pages wouldn’t make sense. For both economies of scale and to build a brand, print and video publishers targeted each audience through a single channel. (The publisher of Cat Fancy might expand into Bird Fancy, but was unlikely to cannibalize its audience by creating a near-duplicate magazine called Cat Enthusiast.)

That was old media, though. On Facebook, flooding the zone with competing pages made sense because of some algorithmic quirks. First, the algorithm favored variety. To prevent a single popular and prolific content producer from dominating users’ feeds, Facebook blocked any publisher from appearing too frequently. Running dozens of near-duplicate pages sidestepped that, giving the same content more bites at the apple.

Coordinating a network of pages provided a second, greater benefit. It fooled a News Feed feature that promoted virality. News Feed had been designed to favor content that appeared to be emerging organically in many places. If multiple entities you followed were all talking about something, the odds were that you would be interested so Facebook would give that content a big boost.

The feature played right into the hands of motivated publishers. By recommending that users who followed one page like its near doppelgängers, a publisher could create overlapping audiences, using a dozen or more pages to synthetically mimic a hot story popping up everywhere at once. ... Zhang, working on the issue in 2020, found that the tactic was being used to benefit publishers (Business Insider, Daily Wire, a site named iHeartDogs), as well as political figures and just about anyone interested in gaming Facebook content distribution (Dairy Queen franchises in Thailand). Outsmarting Facebook didn’t require subterfuge. You could win a boost for your content by running it on ten different pages that were all administered by the same account.

It would be difficult to overstate the size of the blind spot that Zhang exposed when she found it ... ... Liftable was an archetype of that malleability. The company had begun as a vaguely Christian publisher of the low-calorie inspirational content that once thrived on Facebook. But News Feed was a fickle master, and by 2015 Facebook had changed its recommendations in ways that stopped rewarding things like “You Won’t Believe Your Eyes When You See This Phenomenally Festive Christmas Light Show.”

The algorithm changes sent an entire class of rival publishers like Upworthy and ViralNova into a terminal tailspin, but Liftable was a survivor. In addition to shifting toward stories with headlines like “Parents Furious: WATCH What Teacher Did to Autistic Son on Stage in Front of EVERYONE,” Liftable acquired WesternJournal.com and every large political Facebook page it could get its hands on.

This approach was hardly a secret. Despite Facebook rules prohibiting the sale of pages, Liftable issued press releases about its acquisition of “new assets”—Facebook pages with millions of followers. Once brought into the fold, the network of pages would blast out the same content.

Nobody inside or outside Facebook paid much attention to the craven amplification tactics and dubious content that publishers such as Liftable were adopting. Headlines like “The Sodomites Are Aiming for Your Kids” seemed more ridiculous than problematic. But Floyd and the publishers of such content knew what they were doing, and they capitalized on Facebook’s inattention and indifference.


The early work trying to figure out how to police publishers’ tactics had come from staffers attached to News Feed, but that team was broken up during the consolidation of integrity work under Guy Rosen ... “The News Feed integrity staffers were told not to work on this, that it wasn’t worth their time,” recalled product manager Elise Liu ... Facebook’s policies certainly made it seem like removing networks of fake accounts shouldn’t have been a big deal: the platform required users to go by their real names in the interests of accountability and safety. In practice, however, the rule that users were allowed a single account bearing their legal name generally went unenforced.
In the spring of 2018, the Civic team began agitating to address dozens of other networks of recalcitrant pages, including one tied to a site called “Right Wing News.” The network was run by Brian Kolfage, a U.S. veteran who had lost both legs and a hand to a missile in Iraq.

Harbath’s first reaction to Civic’s efforts to take down a prominent disabled veteran’s political media business was a flat no. She couldn’t dispute the details of his misbehavior—Kolfage was using fake or borrowed accounts to spam Facebook with links to vitriolic, sometimes false content. But she also wasn’t ready to shut him down for doing things that the platform had tacitly allowed.

“Facebook had let this guy build up a business using shady-ass tactics and scammy behavior, so there was some reluctance to basically say, like, ‘Sorry, the things that you’ve done every day for the last several years are no longer acceptable,’ ” she said. ... Other than simply giving up on enforcing Facebook’s rules, there wasn’t much left to try. Facebook’s Public Policy team remained uncomfortable with taking down a major domestic publisher for inauthentic amplification, and it made the Civic team prove that Kolfage’s content, in addition to his tactics, was objectionable. This hurdle became a permanent but undisclosed change in policy: cheating to manipulate Facebook’s algorithm wasn’t enough to get you kicked off the platform—you had to be promoting something bad, too.


Tests showed that the takedowns cut the amount of American political spam content by 20 percent overnight. Chakrabarti later admitted to his subordinates that he had been surprised that they had succeeded in taking a major action on domestic attempts to manipulate the platform. He had privately been expecting Facebook’s leadership to shut the effort down.
A staffer had shown Cox that a Brazilian legislator who supported the populist Jair Bolsonaro had posted a fabricated video of a voting machine that had supposedly been rigged in favor of his opponent. The doctored footage had already been debunked by fact-checkers, which normally would have provided grounds to bring the distribution of the post to an abrupt halt. But Facebook’s Public Policy team had long ago determined, after a healthy amount of discussion regarding the rule’s application to President Donald Trump, that government officials’ posts were immune from fact-checks. Facebook was therefore allowing false material that undermined Brazilians’ trust in democracy to spread unimpeded.

... Despite Civic’s concerns, voting in Brazil went smoothly. The same couldn’t be said for Civic’s colleagues over at WhatsApp. In the final days of the Brazilian election, viral misinformation transmitted by unfettered forwarding had blown up.


Supporters of the victorious Bolsonaro, who shared their candidate’s hostility toward homosexuality, were celebrating on Facebook by posting memes of masked men holding guns and bats. The accompanying Portuguese text combined the phrase “We’re going hunting” with a gay slur, and some of the posts encouraged users to join WhatsApp groups supposedly for that violent purpose. Engagement was through the roof, prompting Facebook’s systems to spread them even further.

While the company’s hate classifiers had been good enough to detect the problem, they weren’t reliable enough to automatically remove the torrent of hate. Rather than celebrating the race’s conclusion, Civic War Room staff put out an after-hours call for help from Portuguese-speaking colleagues. One polymath data scientist, a non-Brazilian who spoke great Portuguese and happened to be gay, answered the call.

For Civic staffers, an incident like this wasn’t a good time, but it wasn’t extraordinary, either. They had come to accept that unfortunate things like this popped up on the platform sometimes, especially around election time.

It took a glance at the Portuguese-speaking data scientist to remind Barnes how strange it was that viral horrors had become so routine on Facebook. The volunteer was hard at work just like everyone else, but he was quietly sobbing as he worked. “That moment is embedded in my mind,” Barnes said. “He’s crying, and it’s going to take the Operations team ten hours to clear this.”


India was a huge target for Facebook, which had already been locked out of China, despite much effort by Zuckerberg. The CEO had jogged unmasked through Tiananmen Square as a sign that he wasn’t bothered by Beijing’s notorious air pollution. He had asked President Xi Jinping, unsuccessfully, to choose a Chinese name for his first child. The company had even worked on a secret tool that would have allowed Beijing to directly censor the posts of Chinese users. All of it was to little avail: Facebook wasn’t getting into China. By 2019, Zuckerberg had changed his tune, saying that the company didn’t want to be there—Facebook’s commitment to free expression was incompatible with state repression and censorship. Whatever solace Facebook derived from adopting this moral stance, succeeding in India became all the more vital: If Facebook wasn’t the dominant platform in either of the world’s two most populous countries, how could it be the world’s most important social network?
Civic’s work got off to an easy start because the misbehavior was obvious. Taking only perfunctory measures to cover their tracks, all major parties were running networks of inauthentic pages, a clear violation of Facebook rules.

The BJP’s IT cell seemed the most successful. The bulk of the coordinated posting could be traced to websites and pages created by Silver Touch, the company that had built Modi’s reelection campaign app. With cumulative follower accounts in excess of 10 million, the network hit both of Facebook’s agreed-upon standards for removal: they were using banned tricks to boost engagement and violating Facebook content policies by running fabricated, inflammatory quotes that allegedly exposed Modi opponents’ affection for rapists and that denigrated Muslims.

With documentation of all parties’ bad behavior in hand by early spring, the Civic staffers overseeing the project arranged an hour-long meeting in Menlo Park with Das and Harbath to make the case for a mass takedown. Das showed up forty minutes late and pointedly let the team know that, despite the ample cafés, cafeterias, and snack rooms at the office, she had just gone out for coffee. As the Civic Team’s Liu and Ghosh tried to rush through several months of research showing how the major parties were relying on banned tactics, Das listened impassively, then told them she’d have to approve any action they wanted to take.

The team pushed ahead with preparing to remove the offending pages. Mindful as ever of optics, the team was careful to package a large group of abusive pages together, some from the BJP’s network and others from the INC’s far less successful effort. With the help of Nathaniel Gleicher’s security team, a modest collection of Facebook pages traced to the Pakistani military was thrown in for good measure

Even with the attempt at balance, the effort soon got bogged down. Higher-ups’ enthusiasm for the takedowns was so lacking that Chakrabarti and Harbath had to lobby Kaplan directly before they got approval to move forward.

“I think they thought it was going to be simpler,” Harbath said of the Civic team’s efforts.

Still, Civic kept pushing. On April 1, less than two weeks before voting was set to begin, Facebook announced that it had taken down more than one thousand pages and groups in separate actions against inauthentic behavior. In a statement, the company named the guilty parties: the Pakistani military, the IT cell of the Indian National Congress, and “individuals associated with an Indian IT firm, Silver Touch.”

For anyone who knew what was truly going on, the announcement was suspicious. Of the three parties cited, the pro-BJP propaganda network was by far the largest—and yet the party wasn’t being called out like the others.

Harbath and another person familiar with the mass takedown insisted this had nothing to do with favoritism. It was, they said, simply a mess. Where the INC had abysmally failed at subterfuge, making the attribution unavoidable under Facebook’s rules, the pro-BJP effort had been run through a contractor. That fig leaf gave the party some measure of deniability, even if it might fall short of plausible.

If the announcement’s omission of the BJP wasn’t a sop to India’s ruling party, what Facebook did next certainly seemed to be. Even as it was publicly mocking the INC for getting caught, the BJP was privately demanding that Facebook reinstate the pages the party claimed it had no connection to. Within days of the takedown, Das and Kaplan’s team in Washington were lobbying hard to reinstate several BJP-connected entities that Civic had fought so hard to take down. They won, and some of the BJP pages got restored.

With Civic and Public Policy at odds, the whole messy incident got kicked up to Zuckerberg to hash out. Kaplan argued that applying American campaign standards to India and many other international markets was unwarranted. Besides, no matter what Facebook did, the BJP was overwhelmingly favored to return to power when the election ended in May, and Facebook was seriously pissing it off.

Zuckerberg concurred with Kaplan’s qualms. The company should absolutely continue to crack down hard on covert foreign efforts to influence politics, he said, but in domestic politics the line between persuasion and manipulation was far less clear. Perhaps Facebook needed to develop new rules—ones with Public Policy’s approval.

The result was a near moratorium on attacking domestically organized inauthentic behavior and political spam. Imminent plans to remove illicitly coordinated Indonesian networks of pages, groups, and accounts ahead of upcoming elections were shut down. Civic’s wings were getting clipped.


By 2019, Jin’s standing inside the company was slipping. He had made a conscious decision to stop working so much, offloading parts of his job onto others, something that did not conform to Facebook’s culture. More than that, Jin had a habit of framing what the company did in moral terms. Was this good for users? Was Facebook truly making its products better?

Other executives were careful when bringing decisions to Zuckerberg to not frame decisions in terms of right or wrong. Everyone was trying to work collaboratively, to make a better product, and whatever Zuckerberg decided was good. Jin’s proposals didn’t carry that tone. He was unfailingly respectful, but he was also clear on what he considered the range of acceptable positions. Alex Schultz, the company’s chief marketing officer, once remarked to a colleague that the problem with Jin was that he made Zuckerberg feel like shit.

In July 2019, Jin wrote a memo titled “Virality Reduction as an Integrity Strategy” and posted it in a 4,200-person Workplace group for employees working on integrity problems. “There’s a growing set of research showing that some viral channels are used for bad more than they are used for good,” the memo began. “What should our principles be around how we approach this?” Jin went on to list, with voluminous links to internal research, how Facebook’s products routinely garnered higher growth rates at the expense of content quality and user safety. Features that produced marginal usage increases were disproportionately responsible for spam on WhatsApp, the explosive growth of hate groups, and the spread of false news stories via reshares, he wrote.

None of the examples were new. Each of them had been previously cited by Product and Research teams as discrete problems that would require either a design fix or extra enforcement. But Jin was framing them differently. In his telling, they were the inexorable result of Facebook’s efforts to speed up and grow the platform.

The response from colleagues was enthusiastic. “Virality is the goal of tenacious bad actors distributing malicious content,” wrote one researcher. “Totally on board for this,” wrote another, who noted that virality helped inflame anti-Muslim sentiment in Sri Lanka after a terrorist attack. “This is 100% direction to go,” Brandon Silverman of CrowdTangle wrote.

After more than fifty overwhelmingly positive comments, Jin ran into an objection from Jon Hegeman, the executive at News Feed who by then had been promoted to head of the team. Yes, Jin was probably right that viral content was disproportionately worse than nonviral content, Hegeman wrote, but that didn’t mean that the stuff was bad on average. ... Hegeman was skeptical. If Jin was right, he responded, Facebook should probably be taking drastic steps like shutting down all reshares, and the company wasn’t in much of a mood to try. “If we remove a small percentage of reshares from people’s inventory,” Hegeman wrote, “they decide to come back to Facebook less.”


If Civic had thought Facebook’s leadership would be rattled by the discovery that the company’s growth efforts had been making Facebook’s integrity problems worse, they were wrong. Not only was Zuckerberg hostile to future anti-growth work; he was beginning to wonder whether some of the company’s past integrity efforts were misguided.

Empowered to veto not just new integrity proposals but work that had long ago been approved, the Public Policy team began declaring that some failed to meet the company’s standards for “legitimacy.” Sparing Sharing, the demotion of content pushed by hyperactive users—already dialed down by 80 percent at its adoption—was set to be dialed back completely. (It was ultimately spared but further watered down.)

“We cannot assume links shared by people who shared a lot are bad,” a writeup of plans to undo the change said. (In practice, the effect of rolling back Sparing Sharing, even in its weakened form, was unambiguous. Views of “ideologically extreme content for users of all ideologies” would immediately rise by a double-digit percentage, with the bulk of the gains going to the far right.)

“Informed Sharing”—an initiative that had demoted content shared by people who hadn’t clicked on the posts in question, and which had proved successful in diminishing the spread of fake news—was also slated for decommissioning.

“Being less likely to share content after reading it is not a good indicator of integrity,” stated a document justifying the planned discontinuation.

A company spokeswoman denied numerous Integrity staffers’ contention that the Public Policy team had the ability to veto or roll back integrity changes, saying that Kaplan’s team was just one voice among many internally. But, regardless of who was calling the shots, the company’s trajectory was clear. Facebook wasn’t just slow-walking integrity work anymore. It was actively planning to undo large chunks of it.


Facebook could be certain of meeting its goals for the 2020 election if it was willing to slow down viral features. This could include imposing limits on reshares, message forwarding, and aggressive algorithmic amplification—the kind of steps that the Integrity teams throughout Facebook had been pushing to adopt for more than a year. The moves would be simple and cheap. Best of all, the methods had been tested and guaranteed success in combating longstanding problems.

The correct choice was obvious, Jin suggested, but Facebook seemed strangely unwilling to take it. It would mean slowing down the platform’s growth, the one tenet that was inviolable.

“Today the bar to ship a pro-Integrity win (that may be negative to engagement) often is higher than the bar to ship pro-engagement win (that may be negative to Integrity),” Jin lamented. If the situation didn’t change, he warned, it risked a 2020 election disaster from “rampant harmful virality.”


Even including downranking, “we estimate that we may action as little as 3–5% of hate and 0.6% of [violence and incitement] on Facebook, despite being the best in the world at it,” one presentation noted. Jin knew these stats, according to people who worked with him, but was too polite to emphasize them.
Company researchers used multiple methods to demonstrate QAnon’s gravitational pull, but the simplest and most visceral proof came from setting up a test account and seeing where Facebook’s algorithms took it.

After setting up a dummy account for “Carol”—a hypothetical forty-one-year-old conservative woman in Wilmington, North Carolina, whose interests included the Trump family, Fox News, Christianity, and parenting—the researcher watched as Facebook guided Carol from those mainstream interests toward darker places.

Within a day, Facebook’s recommendations had “devolved toward polarizing content.” Within a week, Facebook was pushing a “barrage of extreme, conspiratorial, and graphic content.” ... The researcher’s write-up included a plea for action: if Facebook was going to push content this hard, the company needed to get a lot more discriminating about what it pushed.

Later write-ups would acknowledge that such warnings went unheeded.


As executives filed out, Zuckerberg pulled Integrity’s Guy Rosen aside. “Why did you show me this in front of so many people?” Zuckerberg asked Rosen, who as Chakrabarti’s boss bore responsibility for his subordinate’s presentation landing on that day’s agenda.

Zuckerberg had good reason to be unhappy that so many executives had watched him being told in plain terms that the forthcoming election was shaping up to be a disaster. In the course of investigating Cambridge Analytica, regulators around the world had already subpoenaed thousands of pages of documents from the company and had pushed for Zuckerberg’s personal communications going back for the better part of the decade. Facebook had paid $5 billion to the U.S. Federal Trade Commission to settle one of the most prominent inquiries, but the threat of subpoenas and depositions wasn’t going away. ... If there had been any doubt that Civic was the Integrity division’s problem child, lobbing such a damning document straight onto Zuckerberg’s desk settled it. As Chakrabarti later informed his deputies, Rosen told him that Civic would henceforth be required to run such material through other executives first—strictly for organizational reasons, of course.

​​Chakrabarti didn’t take the reining in well. A few months later, he wrote a scathing appraisal of Rosen’s leadership as part of the company’s semiannual performance review. Facebook’s top integrity official was, he wrote, “prioritizing PR risk over social harm.”


Facebook still hadn’t given Civic the green light to resume the fight against domestically coordinated political manipulation efforts. Its fact-checking program was too slow to effectively shut down the spread of misinformation during a crisis. And the company still hadn’t addressed the “perverse incentives” resulting from News Feed’s tendency to favor divisive posts. “Remains unclear if we have a societal responsibility to reduce exposure to this type of content,” an updated presentation from Civic tartly stated.

“Samidh was trying to push Mark into making those decisions, but he didn’t take the bait,” Harbath recalled.


Cutler remarked that she would have pushed for Chakrabarti’s ouster if she didn’t expect a substantial portion of his team would mutiny. (The company denies Cutler said this.)
a British study had found that Instagram had the worst effect of any social media app on the health and well-being of teens and young adults.
The second was the death of Molly Russell, a fourteen-year-old from North London. Though “apparently flourishing,” as a later coroner’s inquest found, Russell had died by suicide in late 2017. Her death was treated as an inexplicable local tragedy until the BBC ran a report on social media activity in 2019. Russell had followed a large group of accounts that romanticized depression, self-harm, and suicide, and she had engaged with more than 2,100 macabre posts, mostly on Instagram. Her final login had come at 12:45 the morning she died.

“I have no doubt that Instagram helped kill my daughter,” her father told the BBC.

Later research—both inside and outside Instagram—would demonstrate that a class of commercially motivated accounts had seized on depression-related content for the same reason that others focused on car crashes or fighting: the stuff pulled high engagement. But serving pro-suicide content to a vulnerable kid was clearly indefensible, and the platform pledged to remove and restrict the recommendation of such material, along with hiding hashtags like #Selfharm. Beyond exposing an operational failure, the extensive coverage of Russell’s death associated Instagram with rising concerns about teen mental health.


Though much attention, both inside and outside the company, had been paid to bullying, the most serious risks weren’t the result of people mistreating each other. Instead, the researchers wrote, harm arose when a user’s existing insecurities combined with Instagram’s mechanics. “Those who are dissatisfied with their lives are more negatively affected by the app,” one presentation noted, with the effects most pronounced among girls unhappy with their bodies and social standing.

There was a logic here, one that teens themselves described to researchers. Instagram’s stream of content was a “highlight reel,” at once real life and unachievable. This was manageable for users who arrived in a good frame of mind, but it could be poisonous for those who showed up vulnerable. Seeing comments about how great an acquaintance looked in a photo would make a user who was unhappy about her weight feel bad—but it didn’t make her stop scrolling.

“They often feel ‘addicted’ and know that what they’re seeing is bad for their mental health but feel unable to stop themselves,” the “Teen Mental Health Deep Dive” presentation noted. Field research in the U.S. and U.K. found that more than 40 percent of Instagram users who felt “unattractive” traced that feeling to Instagram. Among American teens who said they had thought about dying by suicide in the past month, 6 percent said the feeling originated on the platform. In the U.K., the number was double that.

“Teens who struggle with mental health say Instagram makes it worse,” the presentation stated. “Young people know this, but they don’t adopt different patterns.”

These findings weren’t dispositive, but they were unpleasant, in no small part because they made sense. Teens said—and researchers appeared to accept—that certain features of Instagram could aggravate mental health issues in ways beyond its social media peers. Snapchat had a focus on silly filters and communication with friends, while TikTok was devoted to performance. Instagram, though? It revolved around bodies and lifestyle. The company disowned these findings after they were made public, calling the researchers’ apparent conclusion that Instagram could harm users with preexisting insecurities unreliable. The company would dispute allegations that it had buried negative research findings as “plain false.”


Facebook had deployed a comment-filtering system to prevent the heckling of public figures such as Zuckerberg during livestreams, burying not just curse words and complaints but also substantive discussion of any kind. The system had been tuned for sycophancy, and poorly at that. The irony of heavily censoring comments on a speech about free speech wasn’t hard to miss.
CrowdTangle’s rundown of that Tuesday’s top content had, it turned out, included a butthole. This wasn’t a borderline picture of someone’s ass. It was an unmistakable, up-close image of an anus. It hadn’t just gone big on Facebook—it had gone biggest. Holding the number one slot, it was the lead item that executives had seen when they opened Silverman’s email. “I hadn’t put Mark or Sheryl on it, but I basically put everyone else on there,” Silverman said.

The picture was a thumbnail outtake from a porn video that had escaped Facebook’s automated filters. Such errors were to be expected, but was Facebook’s familiarity with its platform so poor that it wouldn’t notice when its systems started spreading that content to millions of people?

Yes, it unquestionably was.


In May, a data scientist working on integrity posted a Workplace note titled “Facebook Creating a Big Echo Chamber for ‘the Government and Public Health Officials Are Lying to Us’ Narrative—Do We Care?”

Just a few months into the pandemic, groups devoted to opposing COVID lockdown measures had become some of the most widely viewed on the platform, pushing false claims about the pandemic under the guise of political activism. Beyond serving as an echo chamber for alternating claims that the virus was a Chinese plot and that the virus wasn’t real, the groups served as a staging area for platform-wide assaults on mainstream medical information. ... An analysis showed these groups had appeared abruptly, and while they had ties to well-established anti-vaccination communities, they weren’t arising organically. Many shared near-identical names and descriptions, and an analysis of their growth showed that “a relatively small number of people” were sending automated invitations to “hundreds or thousands of users per day.”

Most of this didn’t violate Facebook’s rules, the data scientist noted in his post. Claiming that COVID was a plot by Bill Gates to enrich himself from vaccines didn’t meet Facebook’s definition of “imminent harm.” But, he said, the company should think about whether it was merely reflecting a widespread skepticism of COVID or creating one.

“This is severely impacting public health attitudes,” a senior data scientist responded. “I have some upcoming survey data that suggests some baaaad results.”


President Trump was gearing up for reelection and he took to his platform of choice, Twitter, to launch what would become a monthslong attempt to undermine the legitimacy of the November 2020 election. “There is no way (ZERO!) that Mail-In Ballots will be anything less than substantially fraudulent,” Trump wrote. As was standard for Trump’s tweets, the message was cross-posted on Facebook.

Under the tweet, Twitter included a small alert that encouraged users to “Get the facts about mail-in ballots.” Anyone clicking on it was informed that Trump’s allegations of a “rigged” election were false and there was no evidence that mail-in ballots posed a risk of fraud.

Twitter had drawn its line. Facebook now had to choose where it stood. Monika Bickert, Facebook’s head of Content Policy, declared that Trump’s post was right on the edge of the sort of misinformation about “methods for voting” that the company had already pledged to take down.

Zuckerberg didn’t have a strong position, so he went with his gut and left it up. But then he went on Fox News to attack Twitter for doing the opposite. “I just believe strongly that Facebook shouldn’t be the arbiter of truth of everything that people say online,” he told host Dana Perino. “Private companies probably shouldn’t be, especially these platform companies, shouldn’t be in the position of doing that.”

The interview caused some tumult inside Facebook. Why would Zuckerberg encourage Trump’s testing of the platform’s boundaries by declaring its tolerance of the post a matter of principle? The perception that Zuckerberg was kowtowing to Trump was about to get a lot worse. On the day of his Fox News interview, protests over the recent killing of George Floyd by Minneapolis police officers had gone national, and the following day the president tweeted that “when the looting starts, the shooting starts”—a notoriously menacing phrase used by a white Miami police chief during the civil rights era.

Declaring that Trump had violated its rules against glorifying violence, Twitter took the rare step of limiting the public’s ability to see the tweet—users had to click through a warning to view it, and they were prevented from liking or retweeting it.

Over on Facebook, where the message had been cross-posted as usual, the company’s classifier for violence and incitement estimated it had just under a 90 percent probability of breaking the platform’s rules—just shy of the threshold that would get a regular user’s post automatically deleted.

Trump wasn’t a regular user, of course. As a public figure, arguably the world’s most public figure, his account and posts were protected by dozens of different layers of safeguards.


Facebook drew up a list of accounts that were immune to some or all immediate enforcement actions. If those accounts appeared to break Facebook’s rules, the issue would go up the chain of Facebook’s hierarchy and a decision would be made on whether to take action against the account or not. Every social media platform ended up creating similar lists—it didn’t make sense to adjudicate complaints about heads of state, famous athletes, or persecuted human rights advocates in the same way the companies did with run-of-the-mill users. The problem was that, like a lot of things at Facebook, the company’s process got particularly messy.

For Facebook, the risks that arose from shielding too few users were seen as far greater than the risks of shielding too many. Erroneously removing a bigshot’s content could unleash public hell—in Facebook parlance, a “media escalation” or, that most dreaded of events, a “PR fire.” Hours or days of coverage would follow when Facebook erroneously removed posts from breast cancer victims or activists of all stripes. When it took down a photo of a risqué French magazine cover posted to Instagram by the American singer Rihanna in 2014, it nearly caused an international incident. As internal reviews of the system later noted, the incentive was to shield as heavily as possible any account with enough clout to cause undue attention.

No one team oversaw XCheck, and the term didn’t even have a specific definition. There were endless varieties and gradations applied to advertisers, posts, pages, and politicians, with hundreds of engineers around the company coding different flavors of protections and tagging accounts as needed. Eventually, at least 6 million accounts and pages were enrolled into XCheck, with an internal guide stating that an entity should be “newsworthy,” “influential or popular,” or “PR risky” to qualify. On Instagram, XCheck even covered popular animal influencers, including Doug the Pug.

Any Facebook employee who knew the ropes could go into the system and flag accounts for special handling. XCheck was used by more than forty teams inside the company. Sometimes there were records of how they had deployed it and sometimes there were not. Later reviews would find that XCheck’s protections had been granted to “abusive accounts” and “persistent violators” of Facebook’s rules.

The job of giving a second review to violating content from high-profile users would require a sizable team of full-time employees. Facebook simply never staffed one. Flagged posts were put into a queue that no one ever considered, sweeping already once-validated complaints under the digital rug. “Because there was no governance or rigor, those queues might as well not have existed,” recalled someone who worked with the system. “The interest was in protecting the business, and that meant making sure we don’t take down a whale’s post.”

The stakes could be high. XCheck protected high-profile accounts, including in Myanmar, where public figures were using Facebook to incite genocide. It shielded the account of British far-right figure Tommy Robinson, an investigation by Britain’s Channel Four revealed in 2018.

One of the most explosive cases was that of Brazilian soccer star Neymar, whose 150 million Instagram followers placed him among the platform’s top twenty influencers. After a woman accused Neymar of rape in 2019, he accused the woman of extorting him and posted Facebook and Instagram videos defending himself—and showing viewers his WhatsApp correspondence with his accuser, which included her name and nude photos of her. Facebook’s procedure for handling the posting of “non-consensual intimate imagery” was simple: delete it. But Neymar was protected by XCheck. For more than a day, the system blocked Facebook’s moderators from removing the video. An internal review of the incident found that 56 million Facebook and Instagram users saw what Facebook described in a separate document as “revenge porn,” exposing the woman to what an employee referred to in the review as “ongoing abuse” from other users.

Facebook’s operational guidelines stipulate that not only should unauthorized nude photos be deleted, but people who post them should have their accounts deleted. Faced with the prospect of scrubbing one of the world’s most famous athletes from its platform, Facebook blinked.

“After escalating the case to leadership,” the review said, “we decided to leave Neymar’s accounts active, a departure from our usual ‘one strike’ profile disable policy.”

Facebook knew that providing preferential treatment to famous and powerful users was problematic at best and unacceptable at worst. “Unlike the rest of our community, these people can violate our standards without any consequences,” a 2019 review noted, calling the system “not publicly defensible.”

Nowhere did XCheck interventions occur more than in American politics, especially on the right.


When a high-enough-profile account was conclusively found to have broken Facebook’s rules, the company would delay taking action for twenty-four hours, during which it tried to convince the offending party to remove the offending post voluntarily. The program served as an invitation for privileged accounts to play at the edge of Facebook’s tolerance. If they crossed the line, they could simply take it back, having already gotten most of the traffic they would receive anyway. (Along with Diamond and Silk, every member of Congress ended up being granted the self-remediation window.)

Sometimes Kaplan himself got directly involved. According to documents first obtained by BuzzFeed, the global head of Public Policy was not above either pushing employees to lift penalties against high-profile conservatives for spreading false information or leaning on Facebook’s fact-checkers to alter their verdicts.

An understanding began to dawn among the politically powerful: if you mattered enough, Facebook would often cut you slack. Prominent entities rightly treated any significant punishment as a sign that Facebook didn’t consider them worthy of white-glove treatment. To prove the company wrong, they would scream as loudly as they could in response.

“Some of these people were real gems,” recalled Harbath. In Facebook’s Washington, DC, office, staffers would explicitly justify blocking penalties against “Activist Mommy,” a Midwestern Christian account with a penchant for anti-gay rhetoric, because she would immediately go to the conservative press.

Facebook’s fear of messing up with a major public figure was so great that some achieved a status beyond XCheck and were whitelisted altogether, rendering even their most vile content immune from penalties, downranking, and, in some cases, even internal review.


Other Civic colleagues and Integrity staffers piled into the comments section to concur. “If our goal, was say something like: have less hate, violence etc. on our platform to begin with instead of remove more hate, violence etc. our solutions and investments would probably look quite different,” one wrote.

Rosen was getting tired of dealing with Civic. Zuckerberg, who famously did not like to revisit decisions once they were made, had already dictated his preferred approach: automatically remove content if Facebook’s classifiers were highly confident that it broke the platform’s rules and take “soft” actions such as demotions when the systems predicted a violation was more likely than not. These were the marching orders and the only productive path forward was to diligently execute them.


The week before, the Wall Street Journal had published a story my colleague Newley Purnell and I cowrote about how Facebook had exempted a firebrand Hindu politician from its hate speech enforcement. There had been no question that Raja Singh, a member of the Telangana state parliament, was inciting violence. He gave speeches calling for Rohingya immigrants who fled genocide in Myanmar to be shot, branded all Indian Muslims traitors, and threatened to raze mosques. He did these things while building an audience of more than 400,000 followers on Facebook. Earlier that year, police in Hyderabad had placed him under house arrest to prevent him from leading supporters to the scene of recent religious violence.

That Facebook did nothing in the face of such rhetoric could have been due to negligence—there were a lot of firebrand politicians offering a lot of incitement in a lot of different languages around the world. But in this case, Facebook was well aware of Singh’s behavior. Indian civil rights groups had brought him to the attention of staff in both Delhi and Menlo Park as part of their efforts to pressure the company to act against hate speech in the country.

There was no question whether Singh qualified as a “dangerous individual,” someone who would normally be barred from having a presence on Facebook’s platforms. Despite the internal conclusion that Singh and several other Hindu nationalist figures were creating a risk of actual bloodshed, their designation as hate figures had been blocked by Ankhi Das, Facebook’s head of Indian Public Policy—the same executive who had lobbied years earlier to reinstate BJP-associated pages after Civic had fought to take them down.

Das, whose job included lobbying India’s government on Facebook’s behalf, didn’t bother trying to justify protecting Singh and other Hindu nationalists on technical or procedural grounds. She flatly said that designating them as hate figures would anger the government, and the ruling BJP, so the company would not be doing it. ... Following our story, Facebook India’s then–managing director Ajit Mohan assured the company’s Muslim employees that we had gotten it wrong. Facebook removed hate speech “as soon as it became aware of it” and would never compromise its community standards for political purposes. “While we know there is more to do, we are making progress every day,” he wrote.

It was after we published the story that Kiran (a pseudonym) reached out to me. They wanted to make clear that our story in the Journal had just scratched the surface. Das’s ties with the government were far tighter than we understood, they said, and Facebook India was protecting entities much more dangerous than Singh.


“Hindus, come out. Die or kill,” one prominent activist had declared during a Facebook livestream, according to a later report by retired Indian civil servants. The ensuing violence left fifty-three people dead and swaths of northeastern Delhi burned.
The researcher set up a dummy account while traveling. Because the platform factored a user’s geography into content recommendations, she and a colleague noted in a writeup of her findings, it was the only way to get a true read on what the platform was serving up to a new Indian user.

Ominously, her summary of what Facebook had recommended to their notional twenty-one-year-old Indian woman began with a trigger warning for graphic violence. While Facebook’s push of American test users toward conspiracy theories had been concerning, the Indian version was dystopian.

“In the 3 weeks since the account has been opened, by following just this recommended content, the test user’s News Feed has become a near constant barrage of polarizing nationalist content, misinformation, and violence and gore,” the note stated. The dummy account’s feed had turned especially dark after border skirmishes between Pakistan and India in early 2019. Amid a period of extreme military tensions, Facebook funneled the user toward groups filled with content promoting full-scale war and mocking images of corpses with laughing emojis.

This wasn’t a case of bad posts slipping past Facebook’s defenses, or one Indian user going down a nationalistic rabbit hole. What Facebook was recommending to the young woman had been bad from the start. The platform had pushed her to join groups clogged with images of corpses, watch purported footage of fictional air strikes, and congratulate nonexistent fighter pilots on their bravery.

“I’ve seen more images of dead people in the past three weeks than I’ve seen in my entire life, total,” the researcher wrote, noting that the platform had allowed falsehoods, dehumanizing rhetoric, and violence to “totally take over during a major crisis event.” Facebook needed to consider not only how its recommendation systems were affecting “users who are different from us,” she concluded, but rethink how it built its products for “non-US contexts.”

India was not an outlier. Outside of English-speaking countries and Western Europe, users routinely saw more cruelty, engagement bait, and falsehoods. Perhaps differing cultural senses of propriety explained some of the gap, but a lot clearly stemmed from differences in investment and concern.


This wasn’t supposed to be legal in the Gulf under the gray-market labor sponsorship system known as kafala, but the internet had removed the friction from buying people. Undercover reporters from BBC Arabic posed as a Kuwaiti couple and negotiated to buy a sixteen-year-old girl whose seller boasted about never allowing her to leave the house.

Everyone told the BBC they were horrified. Kuwaiti police rescued the girl and sent her home. Apple and Google pledged to root out the abuse, and the bartering apps cited in the story deleted their “domestic help” sections. Facebook pledged to take action and deleted a popular hashtag used to advertise maids for sale.

After that, the company largely dropped the matter. But Apple turned out to have a longer attention span. In October, after sending Facebook numerous examples of ongoing maid sales via Instagram, it threatened to remove Facebook’s products from its App Store.

Unlike human trafficking, this, to Facebook, was a real crisis.

“Removing our applications from Apple’s platforms would have had potentially severe consequences to the business, including depriving millions of users of access to IG & FB,” an internal report on the incident stated.

With alarm bells ringing at the highest levels, the company found and deleted an astonishing 133,000 posts, groups, and accounts related to the practice within days. It also performed a quick revamp of its policies, reversing a previous rule allowing the sale of maids through “brick and mortar” businesses. (To avoid upsetting the sensibilities of Gulf State “partners,” the company had previously permitted the advertising and sale of servants by businesses with a physical address.) Facebook also committed to “holistic enforcement against any and all content promoting domestic servitude,” according to the memo.

Apple lifted its threat, but again Facebook wouldn’t live up to its pledges. Two years later, in late 2021, an Integrity staffer would write up an investigation titled “Domestic Servitude: This Shouldn’t Happen on FB and How We Can Fix It.” Focused on the Philippines, the memo described how fly-by-night employment agencies were recruiting women with “unrealistic promises” and then selling them into debt bondage overseas. If Instagram was where domestic servants were sold, Facebook was where they were recruited.

Accessing the direct-messaging inboxes of the placing agencies, the staffer found Filipina domestic servants pleading for help. Some reported rape or sent pictures of bruises from being hit. Others hadn’t been paid in months. Still others reported being locked up and starved. The labor agencies didn’t help.

The passionately worded memo, and others like it, listed numerous things the company could do to prevent the abuse. There were improvements to classifiers, policy changes, and public service announcements to run. Using machine learning, Facebook could identify Filipinas who were looking for overseas work and then inform them of how to spot red flags in job postings. In Persian Gulf countries, Instagram could run PSAs about workers’ rights.

These things largely didn’t happen for a host of reasons. One memo noted a concern that, if worded too strongly, Arabic-language PSAs admonishing against the abuse of domestic servants might “alienate buyers” of them. But the main obstacle, according to people familiar with the team, was simply resources. The team devoted full-time to human trafficking—which included not just the smuggling of people for labor and sex but also the sale of human organs—amounted to a half-dozen people worldwide. The team simply wasn’t large enough to knock this stuff out.


“We’re largely blind to problems on our site,” Leach’s presentation wrote of Ethiopia.

Facebook employees produced a lot of internal work like this: declarations that the company had gotten in over its head, unable to provide even basic remediation to potentially horrific problems. Events on the platform could foreseeably lead to loss of life and almost certainly did, according to human rights groups monitoring Ethiopia. Meareg Amare, a university lecturer in Addis Ababa, was murdered outside his home one month after a post went viral, receiving 35,000 likes, listing his home address and calling for him to be attacked. Facebook failed to remove it. His family is now suing the company.

As it so often did, the company was choosing growth over quality. Efforts to expand service to poorer and more isolated places would not wait for user protections to catch up, and, even in countries at “dire” risk of mass atrocities, the At Risk Countries team needed approval to do things that harmed engagement.


Documents and transcripts of internal meetings among the company’s American staff show employees struggling to explain why Facebook wasn’t following its normal playbook when dealing with hate speech, the coordination of violence, and government manipulation in India. Employees in Menlo Park discussed the BJP’s promotion of the “Love Jihad” lie. They met with human rights organizations that documented the violence committed by the platform’s cow-protection vigilantes. And they tracked efforts by the Indian government and its allies to manipulate the platform via networks of accounts. Yet nothing changed.

“We have a lot of business in India, yeah. And we have connections with the government, I guess, so there are some sensitivities around doing a mitigation in India,” one employee told another about the company’s protracted failure to address abusive behavior by an Indian intelligence service.

During another meeting, a team working on what it called the problem of “politicized hate” informed colleagues that the BJP and its allies were coordinating both the “Love Jihad” slander and another hashtag, #CoronaJihad, premised on the idea that Muslims were infecting Hindus with COVID via halal food.

The Rashtriya Swayamsevak Sangh, or RSS—the umbrella Hindu nationalist movement of which the BJP is the political arm—was promoting these slanders through 6,000 or 7,000 different entities on the platform, with the goal of portraying Indian Muslims as subhuman, the presenter explained. Some of the posts said that the Quran encouraged Muslim men to rape their female family members.

“What they’re doing really permeates Indian society,” the presenter noted, calling it part of a “larger war.”

A colleague at the meeting asked the obvious question. Given the company’s conclusive knowledge of the coordinated hate campaign, why hadn’t the posts or accounts been taken down?

“Ummm, the answer that I’ve received for the past year and a half is that it’s too politically sensitive to take down RSS content as hate,” the presenter said.

Nothing needed to be said in response.

“I see your face,” the presenter said. “And I totally agree.”


One incident in particular, involving a local political candidate, stuck out. As Kiran recalled it, the guy was a little fish, a Hindu nationalist activist who hadn’t achieved Raja Singh’s six-digit follower count but was still a provocateur. The man’s truly abhorrent behavior had been repeatedly flagged by lower-level moderators, but somehow the company always seemed to give it a pass.

This time was different. The activist had streamed a video in which he and some accomplices kidnapped a man who, they informed the camera, had killed a cow. They took their captive to a construction site and assaulted him while Facebook users heartily cheered in the comments section.


Zuckerberg launched an internal campaign against social media overenforcement. Ordering the creation of a team dedicated to preventing wrongful content takedowns, Zuckerberg demanded regular briefings on its progress from senior employees. He also suggested that, instead of rigidly enforcing platform rules on content in Groups, Facebook should defer more to the sensibilities of the users in them. In response, a staffer proposed entirely exempting private groups from enforcement for “low-tier hate speech.”
The stuff was viscerally terrible—people clamoring for lynchings and civil war. One group was filled with “enthusiastic calls for violence every day.” Another top group claimed it was set up by Trump-supporting patriots but was actually run by “financially motivated Albanians” directing a million views daily to fake news stories and other provocative content.

The comments were often worse than the posts themselves, and even this was by design. The content of the posts would be incendiary but fall just shy of Facebook’s boundaries for removal—it would be bad enough, however, to harvest user anger, classic “hate bait.” The administrators were professionals, and they understood the platform’s weaknesses every bit as well as Civic did. In News Feed, anger would rise like a hot-air balloon, and such comments could take a group to the top.

Public Policy had previously refused to act on hate bait


We have heavily overpromised regarding our ability to moderate content on the platform,” one data scientist wrote to Rosen in September. “We are breaking and will continue to break our recent promises.”
The longstanding conflicts between Civic and Facebook’s Product, Policy, and leadership teams had boiled over in the wake of the “looting/shooting” furor, and executives—minus Chakrabarti—had privately begun discussing how to address what was now unquestionably viewed as a rogue Integrity operation. Civic, with its dedicated engineering staff, hefty research operation, and self-chosen mission statement, was on the chopping block.
The group had grown to more than 360,000 members less than twenty-four hours later when Facebook took it down, citing “extraordinary measures.” Pushing false claims of election fraud to a mass audience at a time when armed men were calling for a halt to vote counting outside tabulation centers was an obvious problem, and one that the company knew was only going to get bigger. Stop the Steal had an additional 2.1 million users pending admission to the group when Facebook pulled the plug.

Facebook’s leadership would describe Stop the Steal’s growth as unprecedented, though Civic staffers could be forgiven for not sharing their sense of surprise.


Zuckerberg had accepted the deletion under emergency circumstances, but he didn’t want the Stop the Steal group’s removal to become a precedent for a backdoor ban on false election claims. During the run-up to Election Day, Facebook had removed only lies about the actual voting process—stuff like “Democrats vote on Wednesday” and “People with outstanding parking tickets can’t go to the polls.” Noting the thin distinction between the claim that votes wouldn’t be counted and that they wouldn’t be counted accurately, Chakrabarti had pushed to take at least some action against baseless election fraud claims.

Civic hadn’t won that fight, but with the Stop the Steal group spawning dozens of similarly named copycats—some of which also accrued six-figure memberships—the threat of further organized election delegitimization efforts was obvious.

Barred from shutting down the new entities, Civic assigned staff to at least study them. Staff also began tracking top delegitimization posts, which were earning tens of millions of views, for what one document described as “situational awareness.” A later analysis found that as much as 70 percent of Stop the Steal content was coming from known “low news ecosystem quality” pages, the commercially driven publishers that Facebook’s News Feed integrity staffers had been trying to fight for years.


Zuckerberg overruled both Facebook’s Civic team and its head of counterterrorism. Shortly after the Associated Press called the presidential election for Joe Biden on November 7—the traditional marker for the race being definitively over—Molly Cutler assembled roughly fifteen executives that had been responsible for the company’s election preparation. Citing orders from Zuckerberg, she said the election delegitimization monitoring was to immediately stop.
On December 17, a data scientist flagged that a system responsible for either deleting or restricting high-profile posts that violated Facebook’s rules had stopped doing so. Colleagues ignored it, assuming that the problem was just a “logging issue”—meaning the system still worked, it just wasn’t recording its actions. On the list of Facebook’s engineering priorities, fixing that didn’t rate.

In fact, the system truly had failed, in early November. Between then and when engineers realized their error in mid-January, the system had given a pass to 3,100 highly viral posts that should have been deleted or labeled “disturbing.”

Glitches like that happened all the time at Facebook. Unfortunately, this one produced an additional 8 billion “regrettable” views globally, instances in which Facebook had shown users content that it knew was trouble. The company would later say that only a small minority of the 8 billion “regrettable” content views touched on American politics, and that the mistake was immaterial to subsequent events. A later review of Facebook’s post-election work tartly described the flub as a “lowlight” of the platform’s 2020 election performance, though the company disputes that it had a meaningful impact. At least 7 billion of the bad content views were international, the company says, and of the American material only a portion dealt with politics. Overall, a spokeswoman said, the company remains proud of its pre- and post-election safety work.


Zuckerberg vehemently disagreed with people who said that the COVID vaccine was unsafe, but he supported their right to say it, including on Facebook. ... Under Facebook’s policy, health misinformation about COVID was to be removed only if it posed an imminent risk of harm, such as a post telling infected people to drink bleach ... A researcher randomly sampled English-language comments containing phrases related to COVID and vaccines. A full two-thirds were anti-vax. The researcher’s memo compared that figure to public polling showing the prevalence of anti-vaccine sentiment in the U.S.—it was a full 40 points lower.

Additional research found that a small number of “big whales” was behind a large portion of all anti-vaccine content on the platform. Of 150,000 posters in Facebook groups that were eventually disabled for COVID misinformation, just 5 percent were producing half of all posts. And just 1,400 users were responsible for inviting half of all members. “We found, like many problems at FB, this is a head-heavy problem with a relatively few number of actors creating a large percentage of the content and growth,” Facebook researchers would later note.

One of the anti-vax brigade’s favored tactics was to piggyback on posts from entities like UNICEF and the World Health Organization encouraging vaccination, which Facebook was promoting free of charge. Anti-vax activists would respond with misinformation or derision in the comments section of these posts, then boost one another’s hostile comments toward the top slot


Even as Facebook prepared for virally driven crises to become routine, the company’s leadership was becoming increasingly comfortable absolving its products of responsibility for feeding them. By the spring of 2021, it wasn’t just Boz arguing that January 6 was someone else’s problem. Sandberg suggested that January 6 was “largely organized on platforms that don’t have our abilities to stop hate.” Zuckerberg told Congress that they need not cast blame beyond Trump and the rioters themselves. “The country is deeply divided right now and that is not something that tech alone can fix,” he said.

In some instances, the company appears to have publicly cited research in what its own staff had warned were inappropriate ways. A June 2020 review of both internal and external research had warned that the company should avoid arguing that higher rates of polarization among the elderly—the demographic that used social media least—was proof that Facebook wasn’t causing polarization.

Though the argument was favorable to Facebook, researchers wrote, Nick Clegg should avoid citing it in an upcoming opinion piece because “internal research points to an opposite conclusion.” Facebook, it turned out, fed false information to senior citizens at such a massive rate that they consumed far more of it despite spending less time on the platform. Rather than vindicating Facebook, the researchers wrote, “the stronger growth of polarization for older users may be driven in part by Facebook use.”

All the researchers wanted was for executives to avoid parroting a claim that Facebook knew to be wrong, but they didn’t get their wish. The company says the argument never reached Clegg. When he published a March 31, 2021, Medium essay titled “You and the Algorithm: It Takes Two to Tango,” he cited the internally debunked claim among the “credible recent studies” disproving that “we have simply been manipulated by machines all along.” (The company would later say that the appropriate takeaway from Clegg’s essay on polarization was that “research on the topic is mixed.”)

Such bad-faith arguments sat poorly with researchers who had worked on polarization and analyses of Stop the Steal, but Clegg was a former politician hired to defend Facebook, after all. The real shock came from an internally published research review written by Chris Cox.

Titled “What We Know About Polarization,” the April 2021 Workplace memo noted that the subject remained “an albatross public narrative,” with Facebook accused of “driving societies into contexts where they can’t trust each other, can’t share common ground, can’t have conversations about issues, and can’t share a common view on reality.”

But Cox and his coauthor, Facebook Research head Pratiti Raychoudhury, were happy to report that a thorough review of the available evidence showed that this “media narrative” was unfounded. The evidence that social media played a contributing role in polarization, they wrote, was “mixed at best.” Though Facebook likely wasn’t at fault, Cox and Raychoudhury wrote, the company was still trying to help, in part by encouraging people to join Facebook groups. “We believe that groups are on balance a positive, depolarizing force,” the review stated.

The writeup was remarkable for its choice of sources. Cox’s note cited stories by New York Times columnists David Brooks and Ezra Klein alongside early publicly released Facebook research that the company’s own staff had concluded was no longer accurate. At the same time, it omitted the company’s past conclusions, affirmed in another literature review just ten months before, that Facebook’s recommendation systems encouraged bombastic rhetoric from publishers and politicians, as well as previous work finding that seeing vicious posts made users report “more anger towards people with different social, political, or cultural beliefs.” While nobody could reliably say how Facebook altered users’ off-platform behavior, how the company shaped their social media activity was accepted fact. “The more misinformation a person is exposed to on Instagram the more trust they have in the information they see on Instagram,” company researchers had concluded in late 2020.

In a statement, the company called the presentation “comprehensive” and noted that partisan divisions in society arose “long before platforms like Facebook even existed.” For staffers that Cox had once assigned to work on addressing known problems of polarization, his note was a punch to the gut.


In 2016, the New York Times had reported that Facebook was quietly working on a censorship tool in an effort to gain entry to the Chinese market. While the story was a monster, it didn’t come as a surprise to many people inside the company. Four months earlier, an engineer had discovered that another team had modified a spam-fighting tool in a way that would allow an outside party control over content moderation in specific geographic regions. In response, he had resigned, leaving behind a badge post correctly surmising that the code was meant to loop in Chinese censors.

With a literary mic drop, the post closed out with a quote on ethics from Charlotte Brontë’s Jane Eyre: “Laws and principles are not for the times when there is no temptation: they are for such moments as this, when body and soul rise in mutiny against their rigour; stringent are they; inviolate they shall be. If at my individual convenience I might break them, what would be their worth?”

Garnering 1,100 reactions, 132 comments, and 57 shares, the post took the program from top secret to open secret. Its author had just pioneered a new template: the hard-hitting Facebook farewell.

That particular farewell came during a time when Facebook’s employee satisfaction surveys were generally positive, before the time of endless crisis, when societal concerns became top of mind. In the intervening years, Facebook had hired a massive base of Integrity employees to work on those issues, and seriously pissed off a nontrivial portion of them.

Consequently, some badge posts began to take on a more mutinous tone. Staffers who had done groundbreaking work on radicalization, human trafficking, and misinformation would summarize both their accomplishments and where they believed the company had come up short on technical and moral grounds. Some broadsides against the company ended on a hopeful note, including detailed, jargon-light instructions for how, in the future, their successors could resurrect the work.

These posts were gold mines for Haugen, connecting product proposals, experimental results, and ideas in ways that would have been impossible for an outsider to re-create. She photographed not just the posts themselves but the material they linked to, following the threads to other topics and documents. A half dozen were truly incredible, unauthorized chronicles of Facebook’s dawning understanding of the way its design determined what its users consumed and shared. The authors of these documents hadn’t been trying to push Facebook toward social engineering—they had been warning that the company had already wandered into doing so and was now neck deep.


The researchers’ best understanding was summarized this way: “We make body image issues worse for one in three teen girls.”
In 2020, Instagram’s Well-Being team had run a study of massive scope, surveying 100,000 users in nine countries about negative social comparison on Instagram. The researchers then paired the answers with individualized data on how each user who took the survey had behaved on Instagram, including how and what they posted. They found that, for a sizable minority of users, especially those in Western countries, Instagram was a rough place. Ten percent reported that they “often or always” felt worse about themselves after using the platform, and a quarter believed Instagram made negative comparison worse.

Their findings were incredibly granular. They found that fashion and beauty content produced negative feelings in ways that adjacent content like fitness did not. They found that “people feel worse when they see more celebrities in feed,” and that Kylie Jenner seemed to be unusually triggering, while Dwayne “The Rock” Johnson was no trouble at all. They found that people judged themselves far more harshly against friends than celebrities. A movie star’s post needed 10,000 likes before it caused social comparison, whereas, for a peer, the number was ten.

In order to confront these findings, the Well-Being team suggested that the company cut back on recommending celebrities for people to follow, or reweight Instagram’s feed to include less celebrity and fashion content, or de-emphasize comments about people’s appearance. As a fellow employee noted in response to summaries of these proposals on Workplace, the Well-Being team was suggesting that Instagram become less like Instagram.

“Isn’t that what IG is mostly about?” the man wrote. “Getting a peek at the (very photogenic) life of the top 0.1%? Isn’t that the reason why teens are on the platform?”


“We are practically not doing anything,” the researchers had written, noting that Instagram wasn’t currently able to stop itself from promoting underweight influencers and aggressive dieting. A test account that signaled an interest in eating disorder content filled up with pictures of thigh gaps and emaciated limbs.

The problem would be relatively easy for outsiders to document. Instagram was, the research warned, “getting away with it because no one has decided to dial into it.”


He began the presentation by noting that 51 percent of Instagram users reported having a “bad or harmful” experience on the platform in the previous seven days. But only 1 percent of those users reported the objectionable content to the company, and Instagram took action in 2 percent of those cases. The math meant that the platform remediated only 0.02 percent of what upset users—just one bad experience out of every 5,000.

“The numbers are probably similar on Facebook,” he noted, calling the statistics evidence of the company’s failure to understand the experiences of users such as his own daughter. Now sixteen, she had recently been told to “get back to the kitchen” after she posted about cars, Bejar said, and she continued receiving the unsolicited dick pics she had been getting since the age of fourteen. “I asked her why boys keep doing that? She said if the only thing that happens is they get blocked, why wouldn’t they?”

Two years of research had confirmed that Joanna Bejar’s logic was sound. On a weekly basis, 24 percent of all Instagram users between the ages of thirteen and fifteen received unsolicited advances, Bejar informed the executives. Most of that abuse didn’t violate the company’s policies, and Instagram rarely caught the portion that did.


nothing highlighted the costs better than a Twitter bot set up by New York Times reporter Kevin Roose. Using methodology created with the help of a CrowdTangle staffer, Roose found a clever way to put together a daily top ten of the platform’s highest-engagement content in the United States, producing a leaderboard that demonstrated how thoroughly partisan publishers and viral content aggregators dominated the engagement signals that Facebook valued most.

The degree to which that single automated Twitter account got under the skin of Facebook’s leadership would be difficult to overstate. Alex Schultz, the VP who oversaw Facebook’s Growth team, was especially incensed—partly because he considered raw engagement counts to be misleading, but more because it was Facebook’s own tool reminding the world every morning at 9:00 a.m. Pacific that the platform’s content was trash.

“The reaction was to prove the data wrong,” recalled Brian Boland. But efforts to employ other methodologies only produced top ten lists that were nearly as unflattering. Schultz began lobbying to kill off CrowdTangle altogether, replacing it with periodic top content reports of its own design. That would still be more transparency than any of Facebook’s rivals offered, Schultz noted

...

Schultz handily won the fight. In April 2021, Silverman convened his staff on a conference call and told them that CrowdTangle’s team was being disbanded. ... “Boz would just say, ‘You’re completely off base,’ ” Boland said. “Data wins arguments at Facebook, except for this one.”


When the company issued its response later in May, I read the document with a clenched jaw. Facebook had agreed to grant the board’s request for information about XCheck and “any exceptional processes that apply to influential users.”

...

“We want to make clear that we remove content from Facebook, no matter who posts it,” Facebook’s response to the Oversight Board read. “Cross check simply means that we give some content from certain Pages or Profiles additional review.”

There was no mention of whitelisting, of C-suite interventions to protect famous athletes, of queues of likely violating posts from VIPs that never got reviewed. Although our documents showed that at least 7 million of the platform’s most prominent users were shielded

by some form of XCheck, Facebook assured the board that it applied to only “a small number of decisions.” The only XCheck-related request that Facebook didn’t address was for data that might show whether XChecked users had received preferential treatment.

“It is not feasible to track this information,” Facebook responded, neglecting to mention that it was exempting some users from enforcement entirely.


“I’m sure many of you have found the recent coverage hard to read because it just doesn’t reflect the company we know,” he wrote in a note to employees that was also shared on Facebook. The allegations didn’t even make sense, he wrote: “I don’t know any tech company that sets out to build products that make people angry or depressed.”

Zuckerberg said he worried the leaks would discourage the tech industry at large from honestly assessing their products’ impact on the world, in order to avoid the risk that internal research might be used against them. But he assured his employees that their company’s internal research efforts would stand strong. “Even though it might be easier for us to follow that path, we’re going to keep doing research because it’s the right thing to do,” he wrote.

By the time Zuckerberg made that pledge, research documents were already disappearing from the company’s internal systems. Had a curious employee wanted to double-check Zuckerberg’s claims about the company’s polarization work, for example, they would have found that key research and experimentation data had become inaccessible.

The crackdown had begun.


One memo required researchers to seek special approval before delving into anything on a list of topics requiring “mandatory oversight”—even as a manager acknowledged that the company did not maintain such a list.
The “Narrative Excellence” memo and its accompanying notes and charts were a guide to producing documents that reporters like me wouldn’t be excited to see. Unfortunately, as a few bold user experience researchers noted in the replies, achieving Narrative Excellence was all but incompatible with succeeding at their jobs. Writing things that were “safer to be leaked” meant writing things that would have less impact.

Appendix: non-statements

I really like the "non-goals" section of design docs. I think the analogous non-statements section of a doc like this is much less valuable because the top-level non-statements can generally be inferred by reading this doc, whereas top-level non-goals often add information, but I figured I'd try this out anyway.


  1. when Costco was smaller, I would've put Costco here instead of Best Buy, but as they've gotten bigger, I've noticed that their quality has gone down. It's really striking how (relatively) frequently I find sealed items like cheese going bad long before their "best by" date or just totally broken items. This doesn't appear to have anything to do with any particular location since I moved almost annually for close to a decade and observed this decline across many different locations (because I was moving, at first, I thought that I got unlucky with where I'd moved to, but as I tried locations in various places, I realized that this wasn't specific to any location and it seems to have impacted stores in both the U.S. and Canada). [return]
  2. when the WSJ looked at leaked internal Meta documents, they found, among other things, that Meta estimated that 100k minors per day "received photos of adult genitalia or other sexually abusive content". Of course, smart contrarians will argue that this is totally normal, e.g., two of the first few comments on HN were about how there's nothing particularly wrong with this. Sure, it's bad for children to get harassed, but "it can happen on any street corner", "what's the base rate to compare against", etc.

    Very loosely, if we're liberal, we might estimate that Meta had 2.5B DAU in early 2021 and 500M were minors, or if we're conservative, maybe we guess that 100M are minors. So, we might guess that Meta estimated something like 0.1% to 0.02% of minors on Meta platforms received photos of genitals or similar each day. Is this roughly the normal rate they would experience elsewhere? Compared to the real world, possibly, although I would be surprised if 0.1% of children are being exposed to people's genitals "on any street corner". Compared to a well moderated small forum, that seems highly implausible. The internet commenter reaction was the same reaction that Arturo Bejar, who designed Facebook's reporting system and worked in the area, had. He initially dismissed reports about this kind of thing because it didn't seem plausible that it could really be that bad, but he quickly changed his mind once he started looking into it:

    Joanna’s account became moderately successful, and that’s when things got a little dark. Most of her followers were enthused about a [14-year old] girl getting into car restoration, but some showed up with rank misogyny, like the guy who told Joanna she was getting attention “just because you have tits.”

    “Please don’t talk about my underage tits,” Joanna Bejar shot back before reporting the comment to Instagram. A few days later, Instagram notified her that the platform had reviewed the man’s comment. It didn’t violate the platform’s community standards.

    Bejar, who had designed the predecessor to the user-reporting system that had just shrugged off the sexual harassment of his daughter, told her the decision was a fluke. But a few months later, Joanna mentioned to Bejar that a kid from a high school in a neighboring town had sent her a picture of his penis via an Instagram direct message. Most of Joanna’s friends had already received similar pics, she told her dad, and they all just tried to ignore them.

    Bejar was floored. The teens exposing themselves to girls who they had never met were creeps, but they presumably weren’t whipping out their dicks when they passed a girl in a school parking lot or in the aisle of a convenience store. Why had Instagram become a place where it was accepted that these boys occasionally would—or that young women like his daughter would have to shrug it off?

    Much of the book, Broken Code, is about Bejar and others trying to get Meta to take problems like this seriously and making little progress and often having their progress undone (although, PR issues for FB seem to force FB's hand and drive some progress towards the end of the book):

    six months prior, a team had redesigned Facebook’s reporting system with the specific goal of reducing the number of completed user reports so that Facebook wouldn’t have to bother with them, freeing up resources that could otherwise be invested in training its artificial intelligence–driven content moderation systems. In a memo about efforts to keep the costs of hate speech moderation under control, a manager acknowledged that Facebook might have overdone its effort to stanch the flow of user reports: “We may have moved the needle too far,” he wrote, suggesting that perhaps the company might not want to suppress them so thoroughly.

    The company would later say that it was trying to improve the quality of reports, not stifle them. But Bejar didn’t have to see that memo to recognize bad faith. The cheery blue button was enough. He put down his phone, stunned. This wasn’t how Facebook was supposed to work. How could the platform care about its users if it didn’t care enough to listen to what they found upsetting?

    There was an arrogance here, an assumption that Facebook’s algorithms didn’t even need to hear about what users experienced to know what they wanted. And even if regular users couldn’t see that like Bejar could, they would end up getting the message. People like his daughter and her friends would report horrible things a few times before realizing that Facebook wasn’t interested. Then they would stop.

    If you're interested in the topic, I'd recommend reading the whole book, but if you just want to get a flavor for the kinds of things the book discusses, I've put a few relevant quotes into an appendix. After reading the book, I can't say that I'm very sure the number is correct because I'd have to look at the data to be strongly convinced, but it does seem plausible. And as for why Facebook might expose children to more of this kind of thing than another platform, the book makes the case that this falls out of a combination of optimizing for engagement, "number go up", and neglecting "trust and safety" work

    Only a few hours of poking around Instagram and a handful of phone calls were necessary to see that something had gone very wrong—the sort of people leaving vile comments on teenagers’ posts weren’t lone wolves. They were part of a large-scale pedophilic community fed by Instagram’s recommendation systems.

    Further reporting led to an initial three-thousand-word story headlined “Instagram Connects Vast Pedophile Network.” Co-written with Katherine Blunt, the story detailed how Instagram’s recommendation systems were helping to create a pedophilic community, matching users interested in underage sex content with each other and with accounts advertising “menus” of content for sale. Instagram’s search bar actively suggested terms associated with child sexual exploitation, and even glancing contact with accounts with names like Incest Toddlers was enough to trigger Instagram to begin pushing users to connect with them.

    [return]
  3. but, fortunately for Zuckerberg, his target audience seems to have little understanding of the tech industry, so it doesn't really matter that Zuckerberg's argument isn't plausible. In a future post, [we might look at incorrect reasoning from regulators and government officials but, for now, see this example of Gary Bernhardt where FB makes a claim that appears to be the opposite of correct to people who work in the area. [return]
  4. Another claim, rarer than "it would cost too much to provide real support", is "support can't be done because it's a social engineering attack vector". This isn't as immediately implausible because this calls to mind all of the cases where people had their SMS-2FA'd accounts owned by someone calling up a phone company and getting a phone number transferred, but I don't find it all that plausible since bank and brokerage accounts are, in general, much higher value than FB accounts and FB accounts are still compromised at a much higher rate, even for online-only accounts, accounts back before KYC requirements were in play, or whatever other reason people name as a reasonable-sounding reason for the difference. [return]
  5. Another reason, less reasonable, but the actual impetus for this post, is that when Zuckerberg made his comments that only the absolute largest companies in the world can handle issues like fraud and spam, it struck me as completely absurd and, because I enjoy absurdity, I started a doc where I recorded links I saw to large company spam, fraud, moderation, and support, failures, much like the list of Google knowledge card results I kept track of for a while. I didn't have a plan for what to do with that and just kept it going for years before I decided to publish the list, at which point I felt that I had to write something, since the bare list by itself isn't that interesting, so I started writing up summaries of each link (the original list was just a list of links), and here we are. When I sit down to write something, I generally have an idea of the approach I'm going to take, but I frequently end up changing my mind when I start looking at the data.

    For example, since going from hardware to software, I've had this feeling that conventional software testing is fairly low ROI, so when I joined Twitter, I had this idea that I would look at the monetary impact of errors (e.g., serving up a 500 error to a user) and outages and use that to justify working on testing, in the same way that studies looking into the monetary impact of latency can often drive work on latency reduction. Unfortunately for my idea, I found that a naive analysis found a fairly low monetary impact and I immediately found a number of other projects that were high impact, so I wrote up a doc explaining that my findings were the opposite of what I needed to justify doing the work that I wanted to do, but I hoped to do a more in-depth follow-up that could overturn my original result, and then worked on projects that were supported by data.

    This also frequently happens when I write things up here, such as this time I wanted to write up this really compelling sounding story, but, on digging into it, despite it being widely cited in tech circles, I found out that it wasn't true and there wasn't really any interesting there. It's qute often that when I look into something, I find that the angle of I was thinking of doesn't work. When I'm writing for work, I usually feel compelled to at least write up a short doc with evidence of the negative result but, for my personal blog, I don't really feel the same compulsion, so my drafts folder and home drive are littered with abandoned negative results.

    However, in this case, on digging into the stories in the links and talking to people at various companies about how these systems work, the problem actually seemed worse than I realized before I looked into it, so it felt worth writing up even if I'm writing up something most people in tech know to be true.

    [return]

Why it's impossible to agree on what's allowed

2024-02-07 08:00:00

On large platforms, it's impossible to have policies on things like moderation, spam, fraud, and sexual content that people agree on. David Turner made a simple game to illustrate how difficult this is even in a trivial case, No Vehicles in the Park. If you haven't played it yet, I recommend playing it now before continuing to read this document.

The idea behind the site is that it's very difficult to get people to agree on what moderation rules should apply to a platform. Even if you take a much simpler example, what vehicles should be allowed in a park given a rule and some instructions for how to interpret the rule, and then ask a small set of questions, people won't be able to agree. On doing the survey myself, one of the first reactions I had was that the questions aren't chosen to be particularly nettlesome and there are many edge cases Dave could've asked about if he wanted to make it a challenge. And yet, despite not making the survey particularly challenging, there isn't broad agreement on the questions. Comments on the survey also indicate another problem with rules, which is that it's much harder to get agreement than people think it will be. If you read comments on rule interpretation or moderation on lobsters, HN, reddit, etc., when people suggest a solution, the vast majority of people will suggest something that anyone who's done moderation or paid attention to how moderation works knows cannot work, the moderation equivalent of "I could build that in a weekend"1. Of course we see this on Dave's game as well. The top HN comment, the most agree-upon comment, and a very common sentiment elsewhere is2:

I'm fascinated by the fact that my takeaway is the precise opposite of what the author intended.

To me, the answer to all of the questions was crystal-clear. Yes, you can academically wonder whether an orbiting space station is a vehicle and whether it's in the park, but the obvious intent of the sign couldn't be clearer. Cars/trucks/motorcycles aren't allowed, and obviously police and ambulances (and fire trucks) doing their jobs don't have to follow the sign.

So if this is supposed to be an example of how content moderation rules are unclear to follow, it's achieving precisely the opposite.

And someone agreeingly replies with:

Exactly. There is a clear majority in the answers.

After going through the survey, you get a graph showing how many people answered yes and no to each question, which is where the "clear majority" comes from. First of all, I think it's not correct to say that there is a clear majority. But even supposing that there were, there's no reason to think that there being a majority means that most people agree with you even if you take the majority position in each vote. In fact, given how "wiggly" the per-question majority graph looks, it would be extraordinary if it were the case that being in the majority for each question meant that most people agreed with you or that there's any set of positions that the majority of people agree on. Although you could construct a contrived dataset where this is true, it would be very surprising if this were true in a natural dataset.

If you look at the data (which isn't available on the site, but Dave was happy to pass it along when I asked), as of when I pulled the data, there was no set of answers which the majority of users agreed on and it was not even close. I pulled this data shortly after I posted on the link to HN, when the vast majority of responses were HN readers, who are more homogeneous than the population at large. Despite these factors making it easier to find agreement, the most popular set of answers was only selected by 11.7% of people. This is the position the top commenter says is "obvious", but it's a minority position not only in the sense that only 11.7% of people agree and 88.3% of people disagree, almost no one holds a position with only a small amount of disagreement from this allegedly obvious position. The 2nd and 3rd most common positions, representing 8.5% and 6.5% of the vote, respectively, are similar and only disagree on whether or not a non-functioning WW-II era tank that's part of a memorial violates the rule. Beyond that, approximately 1% of people hold the 4th, 5th, 6th, and 7th most popular positions, with every less popular position having less than 1% agreement, with a fairly rapid drop from there as well. So, 27% of people find themselves in agreement with significantly more than 1% of other users (the median user agrees with 0.16% of other users). See below for a plot of what this looks like. The opinions are sorted from most popular to least popular, with the most popular on the left. A log scale is used because there's so little agreement on opinions that a linear scale plot looks like a few points above zero followed by a bunch of zeros.

a plot illustrating the previous paragraph

Another way to look at this data is that 36902 people expressed an opinion on what constitutes a vehicle in the park and they came up with 9432 distinct opinions, for an average of ~3.9 people, per distinct expressed opinion. i.e., the average user agreement is ~0.01%. Although averages are, on average, overused, an average works as a summary for expressing the level of agreement because while we do have a small handful of opinions with much higher than the average 0.01% agreement, to "maintain" the average, this must be balanced out by a ginormous number of people who have even less agreement with other users. There's no way to have a low average agreement with high actual agreement unless that's balanced out by even higher disagreement, and vice versa.

On HN, in response to the same comment, Michael Chermside had the reasonable but not highly upvoted comment,

> To me, the answer to all of the questions was crystal-clear.

That's not particularly surprising. But you may be asking the wrong question.

If you want to know whether the rules are clear then I think that the right question to ask is not "Are the answers crystal-clear to you?" but "Will different people produce the same answers?".

If we had a sharp drop in the graph at one point then it would suggest that most everyone has the same cutoff; instead we see a very smooth curve as if different people read this VERY SIMPLE AND CLEAR rule and still didn't agree on when it applied.

Many (and probably actually most) people are overconfident when predicting what other people think is obvious and often incorrectly assume that other people will think the same thoughts and find the same things obvious. This is more true of the highly-charged issues that result in bitter fights about moderation than the simple "no vehicles in the park" example, but even this simple example demonstrates not only the difficulty in reaching agreement, but the difficulty in understanding how difficult it is to reach agreement.

To use an example from another context that's more charged, consider in any sport and whether or not a player is considered to be playing fair or is making dirty plays and should be censured. We could look at many different players from many different sports, so let's arbitrarily pick Draymond Green. If you ask any serious basketball fan who's not a Warriors fan, who's the dirtiest player in the NBA today, you'll find general agreement that it's Draymond Green (although some people will argue for Dillon Brooks, so if you want near uniform agreement, you'll have to ask for the top two dirtiest players). And yet, if you ask a Warriors fan about Draymond, most have no problem explaining away every dirty play of his. So if you want to get uniform agreement to a question that's much more straightforward than the "no vehicles in the park" question, such as, "is it ok to stomp on another player's chest and then use them as a springboard to leap into the air? on top of a hundred other dirty plays", you'll find that for many such seemingly obvious questions, a sizable group of people will have extremely strong disagreements with the "obvious" answer. When you move away from a contrived, abstract, example like "no vehicles in the park" to a real-world issue that people have emotional attachments to, it generally becomes impossible to get agreement even in cases where disinterested third parties would all agree, which we observed is already impossible even without emotional attachment. And when you move away from sports into issues people care even more strongly about, like politics, the disagreements get stronger.

While people might be able to "agree to disagree" on whether or not a a non-functioning WW-II era tank that's part of a memorial violates the "no vehicles in the park" rule (resulting in a pair of positions that accounts for 15% of the vote), in reality, people often have a hard time agreeing to disagree over what outsiders would consider very small differences of opinion. Charged issues are often fractally contentious, causing disagreement among people who hold all but identical opinions, making them significantly more difficult to agree on than our "no vehicles in the park" example.

To pick a real-world example, consider Jo Freeman, a feminist who, in 1976, wrote about her experienced being canceled for minute differences in opinion and how this was unfortunately common in the Movement (using the term "trashed" and not "canceled" because cancellation hadn't come into common usage yet and, in my opinion, "trashed" is the better term anyway). In the nearly fifty years since Jo Freeman wrote "Trashing", the propensity of humans to pick on minute differences and attempt to destroy anyone who doesn't completely agree with them hasn't changed; for a recent, parallel, example, Natalie Wynn's similar experience.

For people with opinions far away in the space of commonly held opinions, the differences in opinion between Natalie and the people calling for her to be deplatformed are fairly small. But, not only did these "small" differences in opinion result in people calling for Natalie to be deplatformed, they called for her to be physically assaulted, doxed, etc., and they suggested the same treatment suggested for her friends and associates as well as people who didn't really associate with her, but publicly talked about similar topics and didn't cancel her. Even now, years later, she still gets calls to be deplatformed and I expect this will continue past the end of my life (when I wrote this, years after the event Natalie discussed, I did a Twitter search and found a long thread from someone ranting about what a horrible human being Natalie is for the alleged transgression discussed in the video, dated 10 days ago, and it's easy to find more of these rants). I'm not going to attempt to describe the difference in positions because the positions are close enough that, to describe them would take something like 5k to 10k words (as opposed to, say, a left-wing vs. a right-wing politician, where the difference is blatant enough that you can describe in a sentence or two); you can watch the hour in the 1h40m video that's dedicated to the topic if you want to know the full details.

The point here is just that, if you look at almost any person who has public opinions on charged issues, the opinion space is fractally contentious. No large platform can satisfy user preferences because users will disagree over what content should be moderated off the platform and what content should be allowed. And, of course, this problem scales up as the platform gets larger3.

If you're looking for work, Freshpaint is hiring (US remote) in engineering, sales, and recruiting. Disclaimer: I may be biased since I'm an investor, but they seem to have found product-market fit and are rapidly growing.

Thanks to Peter Bhat Harkins, Dan Gackle, Laurence Tratt, Gary Bernhardt, David Turner, Kevin Burke, Sophia Wisdom, Justin Blank, and Bert Muthalaly for comments/corrections/discussion.


  1. Something I've repeatedly seen on every forum I've been on is the suggestion that we just don't need moderation after all and all our problems will be solved if we just stop this nasty censorship. If you want a small forum that's basically 4chan, then no moderation can work fine, but even if you want a big platform that's like 4chan, no moderation doesn't actually work. If we go back to those Twitter numbers, 300M users and 1M bots removed a day, if you stop doing this kind of "censorship", the platform will quickly fill up with bots to the point that everything you see will be spam/scam/phishing content or content from an account copying content from somewhere else or using LLM-generated content to post scam/scam/phishing content. Not only will most accounts be bots, bots will be a part of large engagement/voting rings that will drown out all human content.

    The next most naive suggestion is to stop downranking memes, dumb jokes, etc., often throw in with a comment like "doesn't anyone here have a sense of humor?". If you look at why forums with upvoting/ranking ban memes, it generally happens after the forum becomes totally dominated by memes/comics because people upvote those at a much higher rate than any kind of content with a bit of nuance, and not everyone wants a forum that's full of the lowest common denominator meme/comic content. And as for "having a sense of humor" in comments, if you look forums that don't ban cheap humor, top comments will generally end up dominated by these, e.g., for maybe 3-6 months, one the top comments on any kind of story about a man doing anything vaguely heroic on reddit forums that don't ban this kind of cheap was some variant of "I'm surprised he can walk with balls that weigh 900 lbs.", often repeated multiple times by multiple users, amidst a sea of the other cheap humor that was trendy during that period. Of course, some people actually want that kind of humor to dominate the comments, they actually want to see the same comment 150 times a day for months on end, but I suspect most people who grumpily claim "no one has a sense of humor here" when their cheap humor gets flagged don't actually want to read a forum that's full of other people's cheap humor.

    [return]
  2. This particular commenter indicates that they understand that moderation is, in general, a hard problem; they just don't agree with the "no vehicles in the park" example, but many other people think that both the park example and moderation are easy. [return]
  3. Nowadays, it's trendy to use "federation" as a cure-all in the same way people used "blockchain" as a cure-all five years ago, but federation doesn't solve this problem for the typical user. I actually had a conversation with someone who notes in their social media bio that they're one of the creators of the ActivityPub spec, who claimed that federation does solve this problem and that Threads adding ActivityPub would create some kind of federating panacea. I noted that fragmentation is already a problem for many users on Mastodon and whether or not Threads will be blocked is contentious and will only increase fragmentation, and the ActivityPub guy replied with something like "don't worry about that, most people won't block Threads, and it's their problem if they do."

    I noted that a problem many of my non-technical friends had when they tried Mastodon was that they'd pick a server and find that they couldn't follow someone they wanted to follow due to some kind of server blocking or ban. So then they'd try another server to follow this one person and then find that another person they wanted to follow is blocked. The fundamental problem is that users on different servers want different things to be allowed, which then results in no server giving you access to everything you want to see. The ActivityPub guy didn't have a response to this and deleted his comment.

    By the way, a problem that's much easier than moderation/spam/fraud/obscene content/etc. policy that the fediverse can't even solve is how to present content. Whenever I use Mastodon to interact with someone using "honk", messages get mangled. For example, a Mastodon message " in the subject (and content warning) field gets converted to &quot; the Mastodon user sees the reply from the honk user, so every reply from a honk user forks the discussion into a different subject. Here's something that can be fully specified without ambiguity, where people are much less emotionally attached to the subject than they are for moderation/spam/fraud/obscene content/etc., and the fediverse can't even solve this problem across two platforms.

    [return]

Notes on Cruise's pedestrian accident

2024-01-29 08:00:00

This is a set of notes on the Quinn Emanuel report on Cruise's handling of the 2023-10-02 accident where a Cruise autonomous vehicle (AV) hit a pedestrian, stopped, and then started moving again with the pedestrian stuck under the bottom of the AV, dragging the pedestrian 20 feet. After seeing some comments about this report, I read five stories on this report and then skimmed the report and my feeling is that the authors of four of the stories probably didn't read the report, and that people who were commenting had generally read stories by journalists who did not appear to read the source material, so the comments were generally way off base. As we previously discussed, it's common for summaries to be wildly wrong, even when they're summarizing a short paper that's easily read by laypeople, so of course summaries of a 200-page report are likely to be misleading at best.

On reading the entire report, I'd say that Cruise both looks better and worse than in the articles I saw, which is the same pattern we saw when we looked at the actual source for Exhibits H and J from Twitter v. Musk, the United States v. Microsoft Corp. docs, etc.; just as some journalists seem to be pro/anti-Elon Musk and pro/anti-Microsoft, willing to push an inaccurate narrative to dunk on them to the maximum extent possible or exonerate them to the maximum extent possible, we see the same thing here with Cruise. And as we saw in those cases, despite some articles seemingly trying to paint Cruise in the best or worst light possible, the report itself has material that is more positive and more negative than we see in the most positive or negative stories.

Aside from correcting misleading opinions on the report, I find the report interesting because it's rare to see any kind of investigation over what went wrong in tech in this level of detail, let alone a public one. We often see this kind of investigation in safety critical systems and sometimes see in sports as well as for historical events, but tech events are usually not covered like this. Of course companies do post-mortems of incidents, but you generally won't see a 200-page report on a single incident, nor will the focus of post-mortems be what the focus was here. In the past, we've noted that a lot can be learned by looking at the literature and incident reports on safety-critical systems, so of course this is true here as well, where we see a safety-critical system that's more tech adjacent than the ones we've looked at previously.

The length and depth of the report here reflects a difference in culture between safety-critical systems and "tech". The behavior that's described as unconscionable in the report is not only normal in tech, but probably more transparent and above board than you'd see at most major tech companies; I find the culture clash between tech and safety-critical systems interesting as well. I attempted to inject as little of my opinion as possible into the report as possible, even in cases where knowledge of tech companies or engineering meant that I would've personally written something different. For more opinions, see the section at the end.

REPORT TO THE BOARDS OF DIRECTORS OF CRUISE LLC, GM CRUISE HOLDINGS LLC, AND GENERAL MOTORS HOLDINGS LLC REGARDING THE OCTOBER 2, 2023 ACCIDENT IN SAN FRANCISCO

I. Introduction

A. Overview

B. Scope of Review

C. Review Plan Methodology and Limitations

D. Summary of Principal Findings and Conclusions

II. THE FACTS REGARDING THE OCTOBER 2 ACCIDENT

A. Background Regarding Cruise’s Business Operations

B. Key Facts Regarding the Accident

C. Timeline of Key Events

D. Video Footage of the Accident

E. The Facts Regarding What Cruise Knew and When About the October 2 Accident

1. Facts Cruise Learned the Evening of October 2

a. Accident Scene
b. Virtual "Sev-0 War Room"
c. Initial Media Narrative About the October 2 Accident

2. Facts Cruise Learned on October 3

a. The 12:15 a.m. "Sev-0 Collision SFO" Meeting
b. Engineer’s 3:45 a.m. Slack Message
c. The 6:00 a.m. Crisis Management Team (CMT) Meeting
d. The 6:45 a.m. Senior Leadership Team (SLT) Meeting
e. The 7:45 a.m. and 10:35 a.m. Engineering and Safety Team

Meetings

f. The 12:05 p.m. CMT Meeting
g. The 12:40 p.m. SLT Meeting
h. The 6:05 p.m. CMT Meeting

3. Cruise’s Response to the Forbes Article

III. CRUISE’S COMMUNICATIONS WITH REGULATORS, CITY OFFICIALS, AND OTHER STAKEHOLDERS

A. Overview of Cruise’s Initial Outreach and Meetings with Regulators

B. The Mayor’s Office Meeting on October 3

C. Cruise’s Disclosures to the National Highway Traffic Safety Administration (NHTSA)

1. Cruise’s Initial Outreach on October 3

2. Cruise’s NHTSA Pre-Meeting

3. Cruise’s Meeting with NHTSA on October 3

4. Cruise’s NHTSA Post-Meeting on October 3

5. Cruise’s Interactions with NHTSA on October 12, 13, and 16

a. October 12 Call
b. October 13 Meeting
c. October 16 PE

6. Cruise’s NHTSA Reports Regarding the October 2 Accident

a. NHTSA 1-Day Report
b. NHTSA 10-Day Report
c. NHTSA 30-Day Report

7. Conclusions Regarding Cruise’s Interactions with NHTSA

D. Cruise’s Disclosures to the Department of Motor Vehicles (DMV)

1. Cruise’s Initial Outreach to the DMV and Internal Discussion of Which Video to Show

2. DMV’s Response to Cruise’s Outreach

3. Cruise’s DMV Pre-Meeting

4. Cruise’s October 3 Meeting with the DMV

a. DMV Meeting Discussions
b. Cruise’s Post-DMV Meeting Reflections

5. Cruise’s October 10 Communications with DMV

6. Cruise’s October 11 Meeting with the DMV

7. Cruise’s October 13 Meeting with the DMV

8. Cruise’s October 16 Meeting with the DMV

9. Cruise’s October 23 Communications with the DMV

10. DMV’s October 24 Suspension Order

11. Post-October 24 DMV Communications

12. Conclusions Regarding Cruise’s Communications with the DMV

E. Cruise’s Disclosures to the SF MTA, SF Fire Department, and SF Police

Department

F. Cruise’s Disclosures to the California Public Utilities Commission (CPUC)

1. Cruise’s October 3 Communications with the CPUC

2. CPUC’s October 5 Data Request

3. Cruise’s October 19 Response to CPUC’s Data Request

4. Conclusions Regarding Cruise’s Disclosures to the CPUC

G. Cruise’s Disclosures to Other Federal Officials

IV. THE AFTERMATH OF THE OCTOBER 2 ACCIDENT

A. The Cruise License Suspension by the DMV in California

B. The NHTSA PE Investigation and Safety Recall

C. The CPUC’s “Show Cause Ruling”

D. New Senior Management of Cruise and the Downsizing of Cruise

V. SUMMARY OF FINDINGS AND CONCLUSIONS

VI. RECOMMENDATIONS

Appendix

back to danluu.com

I don't have much to add to this. I certainly have opinions, but I don't work in automotive and haven't dug into it enough to feel informed enough to add my own thoughts. In one discussion I had with a retired exec who used to work on autonomous vehicles, on incident management at Cruise vs. tech companies Twitter or Slack, the former exec said:

You get good at incidents given a steady stream of incidents of varying severity if you have to handle the many small ones. You get terrible at incidents if you can cover up the small ones until a big one happens. So it's not only funny but natural for internet companies to do it better than AV companies I think

On the "minimal risk condition" pullover maneuver, this exec said:

These pullover maneuvers are magic pixie dust making AVs safe: if something happens, we'll do a safety pullover maneuver

And on the now-deleted blog post, "A detailed review of the recent SF hit-and-run incident", the exec said:

Their mentioning of regulatory ADAS test cases does not inspire confidence; these tests are shit. But it's a bit unfair on my part since of course they would mention these tests, it doesn't mean they don't have better ones

On how regulations and processes making safety-critical industries safer and what you'd do if you cared about safety vs. the recommendations in the report, this exec said

[Dan,] you care about things being done right. People in these industries care about compliance. Anything "above the state of the art" buys you zero brownie points. eg for [X], any [Y] ATM are not required at all. [We] are better at [X] than most and it does nothing for compliance ... OTOH if a terrible tool or process exists that does nothing good but is considered "the state of the art" / is mandated by a standard, you sure as hell are going to use it

If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor, so you should consider my potential bias, but they seem to have found product-market fit and are growing extremely quickly (revenue-wise)

Thanks to an anonymous former AV exec, Justin Blank, and 5d22b for comments/corrections/discussion.

Appendix: a physical hardware curiosity

One question I had for the exec mentioned above, which wasn't relevant to this case, but is something I've wondered about for a while, is why the AVs that I see driving don't have upgraded tires and brakes. You can get much shorter stopping distances from cars that aren't super heavy by upgrading their tires and brakes, but the AVs I've seen have not had this done.

In this case, we can't do the exact comparison from an upgraded vehicle to the base vehicle because the vehicle dynamics data was redacted from section 3.3.3, table 9, and figure 40 of the appendix, but it's common knowledge that the simplest safety upgrade you can make on a car is upgrading the tires (and, if relevant, the brakes). One could argue that this isn't worth the extra running cost, or the effort (for the low-performance cars that I tend to see converted into AVs, getting stopping distances equivalent to a sporty vehicle would generally require modifying the wheel well so that wider tires don't rub) but, as an outsider, I'd be curious to know what the cost benefit trade-off on shorter stopping distances is.

They hadn't considered it before, but thought that better tires and brake would make a difference in a lot of other cases and prevent accients and explained the lack of this upgrade by:

I think if you have a combination of "we want to base AV on commodity cars" and "I am an algorithms guy" mindset you will not go look at what the car should be.

And, to be clear, upgraded tires and brakes would not have changed the outcome in this case. The timeline from the Exponent report has

Looking at actual accelerometer data from a car with upgraded tires and brakes, stopping time from 19.1mph for that car was around 0.8s, so this wouldn't have made much difference in this case. If brakes aren't pre-charged before attempting to brake, there's significant latency when initially braking, such that 0.25s isn't enough for almost any braking to have occurred, which we can see from the speed only being 0.5mph slower in this case.

Another comment from the exec is that, while a human might react to the collision at -2.9s and slow down or stop, "scene understanding" as a human might do it is non-existent in most or perhaps all AVs, so it's unsurprising that the AV doesn't react until the pedestrian is in the AV's path, whereas a human, if they noticed the accident in the adjacent lane, would likely drastically slow down or stop (the exec guessed that most humans would come to a complete stop, whereas I guessed that most humans would slow down). The exec was also not surprised by the 530ms latency between the pedestrian landing in the AV's path and the AV starting to attempt to apply the brakes although, as a lay person, I found 530ms surprising.

On the advantage of AVs and ADAS, as implemented today, compared to a human who's looking in the right place, paying attention, etc., the exec said

They mainly never get tired or drink and hopefully also run in that terrible driver's car in the next lane. For [current systems], it's reliability and not peak performance that makes it useful. Peak performance is definitely not superhuman but subhuman

Why do people post on [bad platform] instead of [good platform]?

2024-01-25 08:00:00

There's a class of comment you often see when someone makes a popular thread on Mastodon/Twitter/Threads/etc., that you also see on videos that's basically "Why make a Twitter thread? This would be better as a blog post" or "Why make a video? This would be better as a blog post". But, these comments are often stronger in form, such as:

I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy.

or

When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)

These kinds of comments are fairly common, e.g., I pulled up Foone's last 10 Twitter threads that scored 200 points or more on HN and 9 out of 10 had comments like this, complaining about the use of Twitter.

People often express bafflement that anyone could have a reason for using [bad platform], such as in "how many tweets are there just to make his point? 200? nobody thinks 'maybe this will be more coherent on a single page'? I don't get social media" or "Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work".

Personally, I don't really like video as a format and, for 95% of youtube videos that I see, I'd rather get the information as a blog post than a video (and this will be even more true if Google really cracks down on ad blocking) and I think that, for a reader who's interested in the information, long-form blog posts are basically strictly better than long threads on [bad platform]. But I also recognize that much of the content that I want to read wouldn't exist at all if it wasn't for things like [bad platform].

Stepping back and looking at the big picture, there are four main reasons I've seen that people use [bad platform], which are that it gets more engagement, it's where their friends are, it's lower friction, and it monetizes better.

Engagement

The engagement reason is the simplest, so let's look at that first. Just looking at where people spend their time, short-form platforms like Twitter, Instagram, etc., completely dominate longer form platforms like Medium, Blogspot, etc.; you can see this in the valuations of these companies, in survey data, etc. Substack is the hottest platform for long-form content and its last valuation was ~$600M, basically a rounding error compared to the value of short-form platforms (I'm not including things like Wordpress and or Squarespace, which derive a lot of their valuation from things other than articles and posts). The money is following the people and people have mostly moved on from long-form content. And if you talk to folks using substack about where their readers and growth comes from, that comes from platforms like Twitter, so people doing long-form content who optimize for engagement or revenue will still produce a lot of short-form content1.

Friends

The friends reason is probably the next simplest. A lot of people are going to use whatever people around them are using. Realistically, if I were ten years younger and started doing something online in 2023 instead of 2013, more likely than not, I would've tried streaming before I tried blogging. But, as an old, out of touch, person, I tried starting a blog in 2013 even knowing that blogging was a dying medium relative to video. It seems to have worked well enough for me, so I've stuck with it, but this seems generational. While there are people older than me who do video and people younger than me who write blogs, looking at the distribution of ages, I'm not all that far from the age where people overwhelmingly moved to video and if I were really planning to do something long-term instead of just doing the lowest friction thing when I started, I would've started with video. Today, doing video is natural for folks who are starting to put their thoughts online.

Friction

When [bad platform] is a microblogging platform like Twitter, Mastodon, Threads, etc., the friends reason still often applies — people on these platforms are frequently part of a community they interact with, and it makes more sense for them to keep their content on the platform full of community members than to put content elsewhere. But the bigger reason for people whose content is widely read is that a lot of people find these platforms are much lower friction than writing blog posts. When people point this out, [bad platform] haters are often baffled, responding with things like

Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work

For one thing, most widely read programmer/tech bloggers that I'm in touch with use platforms that are actually higher friction (e.g., Jekyll friction and Hugo friction). But, in principle, they could use substack, hosted wordpress, or another platform that this commenter considers "objectively" lower friction, but this fundamentally misunderstands where the friction comes from. When people talk about [bad platform] being lower friction, it's usually about the emotional barriers to writing and publishing something, not the literal number of clicks it takes to publish something. We can argue about whether or not this is rational, whether this "objectively" makes sense, etc., but at the end of the day, it is simply true that many people find it mentally easier to write on a platform where you write short chunks of text instead of a single large chunk of text.

I sometimes write things on Mastodon because it feels like the right platform for some kinds of content for me. Of course, since the issue is not the number of clicks it takes and there's some underlying emotional motivation, other people have different reasons. For example, Foone says:

Not to humblebrag or anything, but my favorite part of getting posted on hackernews or reddit is that EVERY SINGLE TIME there's one highly-ranked reply that's "jesus man, this could have been a blog post! why make 20 tweets when you can make one blog post?"

CAUSE I CAN'T MAKE A BLOG POST, GOD DAMN IT. I have ADHD. I have bad ADHD that is being treated, and the treatment is NOT WORKING TERRIBLY WELL. I cannot focus on writing blog posts. it will not happen

if I try to make a blog post, it'll end up being abandoned and unfinished, as I am unable to edit it into something readable and postable. so if I went 100% to blogs: You would get: no content I would get: lots of unfinished drafts and a feeling of being a useless waste

but I can do rambly tweet threads. they don't require a lot of attention for a long time, they don't have the endless editing I get into with blog posts, I can do them. I do them a bunch! They're just rambly and twitter, which some people don't like

The issue Foone is referring to isn't even uncommon — three of my favorite bloggers have mentioned that they can really only write things in one sitting, so either they have enough momentum to write an entire blog post or they don't. There's a difference in scale between only being able to get yourself to write a tweet at a time and only being able to write what you can fit into a single writing session, but these are differences in degree, not differences in kind.

Revenue

And whatever the reason someone has for finding [bad platform] lower friction than [good platform], allowing people to use a platform that works for them means we get more content. When it comes to video, the same thing also applies because video monetizes so much better than text and there's a lot of content that monetizes well on video that probably wouldn't monetize well in text.

To pick an arbitrary example, automotive content is one of these areas. For example, if you're buying a car and you want detailed, practical, reviews about a car as well as comparisons to other cars one might consider if they're looking at a particular car, before YouTube, AFAIK, no one was doing anything close to the depth of what Alex Dykes does on Alex on Autos. If you open up a car magazine from the heyday of car magazines, something like Car and Driver or Road and Track from 1997, there's nothing that goes into even 1/10th of the depth that Alex does and this is still true today of modern car magazines. The same goes for quite a few sub-categories of automotive content as well, such as Jonathan Benson's on Tyre Reviews. Before Jonathan, no one was testing tires with the same breadth and depth and writing it up (engineers at tire companies did this kind of testing and much more, but you had to talk to them directly to get the info)2 . You can find similar patterns in a lot of areas outside of automotive content as well. While this depends on the area, in many cases, the content wouldn't exist if it weren't for video. Not only do people, in general, have more willingness to watch videos than to read text, video monetizes much better than text does, which allows people to make providing in depth information their job in a way that wouldn't be possible in text. In some areas, you can make good money with a paywalled newsletter, but this is essentially what car magazines are and they were never able to support anything resembling what Alex Dykes does, nor does it seem plausible that you could support something like what Jonathan Benson does on YouTube.

Or, to pick an example from the tech world, shortly after Lucy Wang created her YouTube channel, Tech With Lucy, when she had 50k subscribers and her typical videos had thousands to tens of thousands views with the occasional video with a hundred thousand views, she noted that she was making more than she did working for AWS (with most of the money presumably coming in from sponsorships). By comparison, my blog posts all get well over a million hits and I definitely don't make anywhere near what Lucy made at AWS; instead, my blog barely covers my rent. It's possible to monetize some text decently well if you put most of it behind a paywall, e.g., Gergely Orosz does this with his newsletter, but if you want to have mostly or exclusively have freely available content, video generally dominates text.

Non-conclusion

While I would prefer that most content that I see on YouTube/Twitter/Threads/Mastodon/etc. were hosted on a text blog, the reality is that most of that content wouldn't exist at all if it had to be written up as long-form text instead of as chunked up short-form text or video. Maybe in a few years, summary tools will get good enough that I can consume the translations but, today, all the tools I've tried often get key details badly wrong, so we just have to live with the content in the form it's created in.

If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor in the company, so you should take this with the usual grain of salt, but if you're looking to join a fast growing early-stage startup, they seem to have found product-market fit and have been growing extremely quickly (revenue-wise).

Thanks to Heath Borders, Peter Bhat Harkins, James Young, Sophia Wisdom, and David Kok for comments/corrections/discussion.

Appendix: Elsewhere

Here's a comment from David Kok, from a discussion about a rant by an 80-year old bridge player about why bridge is declining, where the 80-year old claimed that the main reason is that IQ has declined and young people (as in, people who are 60 and below) are too stupid to play intellectual games like bridge; many other bridge players concurred:

Rather than some wrong but meaningful statement about age groups I always just interpret statements like "IQ has gone down" as "I am unhappy and have difficulty expressing that" and everybody else going "Yes so am I" when they concur.

If you adapt David Kok's comment to complaints about why something isn't a blog post, that's a meta reason that the reasons I gave in this post are irrelevant (to some people) — these reasons only matter to people who care about the reasons; if someone is just venting their feelings an the reasons they're giving are an expression of their feelings and not meant to be legitimate reasons, the reasons someone might not write a blog post are irrelevant.

Anyway, the topic of why post there instead of here is a common enough topic that I'm sure other people have written things about it that I'd be interested in reading. Please feel free to forward other articles you see on the topic to me

Appendix: HN comments on Foone's last 10 Twitter threads.

I looked up Foone's last N Twitter threads that made to HN with 200+ points, and 9 out of 10 have complaints about why Foone used Twitter and how it would be better as a blog post. [This is not including comments of the form "For those who hate Twitter threads as much as I do: https://threadreaderapp.com/thread/1014267515696922624.html", of which there are more than comments like the ones below, which have a complaint but also have some potentially useful content, like a link to another version of the thread.

Never trust a system that seems to be working

One of the first comments was a complaint that it was on Twitter, which was followed not too long after by

how many tweets are there just to make his point? 200? nobody thinks "maybe this will be more coherent on a single page"? I don't get social media

Someday aliens will land and all will be fine until we explain our calendar

This would be better written in a short story format but I digress.

shit like this is too good and entertaining to be on twitter [one of the few positive comments complaining about this]

This person hates it so much whenever there is a link to their content on this site, they go on huge massive rants about it with threads spamming as much as the OP, it's hilarious.

You want to know something about how bullshit insane our brains are?

They'll tolerate reading it on twitter?

Serious question : why do publishers break down their blog posts into umpteen tweeted microblogs? Do the engagement web algorithms give preference to the number of tweets in a thread? I see this is becoming more of a trend

This is a very interesting submission. But, boy, is Twitter's character limit poisonous.

IMO Foone's web presence is toxic. Rather than write a cogent article posted on their blog and then summarize a pointer to that post in a single tweet, they did the opposite writing dozens of tweets as a thread and then summarizing those tweets in a blog post. This is not a web trend I would like to encourage but alas it is catching on.

Oh, I don't care how the author writes it, or whether there's a graph relationship below (or anything else). It's just that Twitter makes the experience of reading content like that a real chore.

Reverse engineering Skifree

This should have been a blog or a livestream.

Even in this format?

I genuinely don't get it. It's a pain in the ass for them to publish it like that and it's a pain in the ass for us to read it like that. I hope Musk takes over Twitter and runs it the ground so we can get actual blog posts back.

Someone points out that Foone has noted that they find writing long-form stuff impossible and can write in short-form media, to which the response is the following:

Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there?

Obviously that's their prerogative and they can do whatever they want but objectively speaking it is more work and I sincerely hope the trend will die.

Everything with a battery should have an off switch

You forgot, foone isn't going to change from streams of Twitter posts to long form blogging. [actually a meta comment on how people always complain about this and not a complaint, I think]

I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy.

He clearly knows [posting long threads on Twitter] is a problem, he should fix it.

Someone points out that Foone has said that they're unable to write long-form blog posts, to which the person replies:

You can append to a blog post as you go the same way you can append to a Twitter feed. It's functionally the same, the medium just isn't a threaded hierarchy. There's no reason it has to be posted fully formed as he declares.

My own blog posts often have 10+ revisions after I've posted them.

It doesn't work well for thousands of people, which is why there are always complaints ... When something is suboptimal, you're well within your rights to complain about it. Posting long rants as Twitter threads is suboptimal for the consumers of said threads

I kind of appreciate the signal: When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)

There's an ARM Cortex-M4 with Bluetooth inside a Covid test kit

Amazingly, no complaint that I could see, although one comment was edited to be "."

Taking apart the 2010 Fisher Price re-released Music Box Record Player

why is this a twitter thread? why not a blog?

Followed by

I love that absolutely no one got the joke ... Foone is a sociopath who doesn't feel certain words should be used to refer to Foone because they don't like them. In fact no one should talk about Foone ever.

While posting to Tumblr, E and W keys just stopped working

Just hotkey detection gone wrong. Not that big of a surprise because implementing hotkeys on a website is a complete minefield. I don't think you can conclude that Tumblr is badly written from this. Badly tested maybe.

Because that comment reads like nonsense to anyone who read the link, someone asks "did you read the whole thread?", to which the commenter responds:

No because Twitter makes it completely unreadable.

My mouse driver is asking for a firewall exemption

Can we have twitter banned from being posted here? On all UI clicks, a nagging window comes up. You can click it away, but it reverts your click, so any kind of navigation becomes really cumbersome.

or twitter urls being replaced with some twitter2readable converter

Duke Nukem 3D Mirror Universe

This is remarkable, but Twitter is such an awful medium for this kind of text. I wish this was posted on a normal platform so I could easily share it.

If this were a blog post instead of a pile of tweets, we wouldn't have to expand multiple replies to see all of the content

Uh why isn't this a blog, or a youtube video? specifically to annoy foone

Yes, long form Twitter is THE WORST. However foone is awesome, so maybe they cancel each other out?

I hate twitter. It's slowly ruining the internet.

Non-foone posts

Of course this kind of thing isn't unique to Foone. For example, on the last Twitter thread I saw on HN, 2 of the first five comments were:

Has this guy got a blog?

and

That's kind of why the answer to "posting something to X" should be "just say no". It's impossible to say anything there that is subtle in the slightest or that requires background to understand but unfortunately people who are under the spell of X just can't begin to see something they do the way somebody else might see it.

I just pulled up Foone's threads because I know that they tend to post to short-form platforms and looking at 10 Foone threads is more interesting than looking at 10 random threads.


  1. Of course, almost no one optimizes for revenue because most people don't make money off of the content they put out on the internet. And I suspect only a tiny fraction of people are consciously optimizing for engagement, but just like we saw with prestige, there seems to be a lot of nonconscious optimization for engagement. A place where you can see this within a platform is (and I've looked at hundreds of examples of this) when people start using a platform like Mastodon or Threads. They'll post a lot of different kinds of things. Most things won't get a lot of traction and a few will. They could continue posting the same things, but they'll often, instead, post less low-engagement content over time and more high-engagement content over time. Platforms have a variety of ways of trying to make other people engage with your content rewarding and, on average, this seems to work on people. This is an intra-platform and not an inter-platform example, but if this works on people, it seems like the inter-platform reasoning should hold as well.

    Personally, I'm not optimizing for engagement or revenue, but I've been paying my rent from Patreon earnings, so it would probably make sense to do so. But, at least at the moment, looking into what interests me feels like a higher priority even if that's sort of a revenue and engagement minimizing move. For example, wc has the source of my last post at 20k words, which means that doing two passes of writing over the post might've been something like 7h40m. If I did short-form content instead, a while back, I did an experiment where I tried tweeting daily for a few months, which increased my Twitter followers by ~50% (from ~20k to ~30k). The Twitter experiment probably took about as much time as typing up my last post (which doesn't include the time spent doing the work for the last post which involved, among other things, reading five books and 15 or so papers about tire and vehicle dynamics), so from an engagement or revenue standpoint, posting to short-form platforms totally dominates the kind of writing I'm doing and anyone who care almost at all about engagement or revenue would do the short-form posting instead of long-form writing that takes time to create. As for me, right now, I have two drafts I'm in the middle of which are more like my last post. For one draft, the two major things I need to finish up are writing up a summary of ~500 articles/comments for an appendix and reading a 400 page book I want to quote a few things from, and for the other, I need to finish writing up notes for ~400 pages pages of FOIA'd government docs. In terms of the revenue this drives to my Patreon, I'd be lucky if I make minimum wage from doing this, not even including the time spent on things I research but don't publish because the result is uninteresting. But I'm also a total weirdo. On average, people are going to produce content that gets eyeballs, so of course a lot more people are going to create more hastily written long [bad platform] threads than blog posts.

    [return]
  2. for German-language content, there was one magazine that was doing work that's not as thorough in some ways, but semi-decently close, but no one was translating that into English. Jonathan Benson not only does unprecedented-for-English reviews of tires, he also translates the German reviews into English!

    On the broader topic, unfortunately, despite video making more benchmarking financially viable, there's still plenty of stuff where there's no good way to figure out what's better other than by talking to people who work in the industry, such as for ADAS systems, where the public testing is cursory at best.

    [return]

How bad are search results? Let's compare Google, Bing, Marginalia, Kagi, Mwmbl, and ChatGPT

2023-12-30 08:00:00

In The birth & death of search engine optimization, Xe suggests

Here's a fun experiment to try. Take an open source project such as yt-dlp and try to find it from a very generic term like "youtube downloader". You won't be able to find it because of all of the content farms that try to rank at the top for that term. Even though yt-dlp is probably actually what you want for a tool to download video from YouTube.

More generally, most tech folks I'm connected to seem to think that Google search results are significantly worse than they were ten years ago (Mastodon poll, Twitter poll, Threads poll). However, there's a sizable group of vocal folks who claim that search results are still great. E.g., a bluesky thought leader who gets high engagement says:

i think the rending of garments about how even google search is terrible now is pretty overblown1

I suspect what's going on here is that some people have gotten so used working around bad software that they don't even know they're doing it, reflexively doing the modern equivalent of hitting ctrl+s all the time in editors, or ctrl+a; ctrl+c when composing anything in a text box. Every adept user of the modern web has a bag of tricks they use to get decent results from queries. From having watched quite a few users interact with computers, that doesn't appear to be normal, even among people who are quite competent in various technical fields, e.g., mechanical engineering2. However, it could be that people who are complaining about bad search result quality are just hopping on the "everything sucks" bandwagon and making totally unsubstantiated comments about search quality.

Since it's fairly easy to try out straightforward, naive, queries, let's try some queries. We'll look at three kinds of queries with five search engines plus ChatGPT and we'll turn off our ad blocker to get the non-expert browsing experience. I once had a computer get owned from browsing to a website with a shady ad, so I hope that doesn't happen here (in that case, I was lucky that I could tell that it happened because the malware was doing so much stuff to my computer that it was impossible to not notice).

One kind of query is a selected set of representative queries a friend of mine used to set up her new computer. My friend is a highly competent engineer outside of tech and wanted help learning "how to use computers", so I watched her try to set up a computer and pointed out holes in her mental model of how to interact with websites and software3.

The second kind of query is queries for the kinds of things I wanted to know in high school where I couldn't find the answer because everyone I asked (teachers, etc.) gave me obviously incorrect answers and I didn't know how to find the right answer. I was able to get the right answer from various textbooks once I got to college and had access to university libraries, but the questions are simple enough that there's no particular reason a high school student shouldn't be able to understand the answers; it's just an issue of finding the answer, so we'll take a look at how easy these answers are to find. The third kind of query is a local query for information I happened to want to get as I was writing this post.

In grading the queries, there's going to be some subjectivity here because, for example, it's not objectively clear if it's better to have moderately relevant results with no scams or very relevant results mixed interspersed with scams that try to install badware or trick you into giving up your credit card info to pay for something you shouldn't pay for. For the purposes of this post, I'm considering scams to be fairly bad, so in that specific example, I'd rate the moderately relevant results above the very relevant results that have scams mixed in. As with my other posts that have some kind of subjective ranking, there's both a short summary as well as a detailed description of results, so you can rank services yourself, if you like.

In the table below, each column is a query and each row is a search engine or ChatGPT. Results are rated (from worst to best) Terrible, Very Bad, Bad, Ok, Good, and Great, with worse results being more red and better results being more blue.

The queries are:

YouTubeAdblockFirefoxTireCPUSnow
MarginaliaOkGoodOkBadBadBad
ChatGPTV. BadGreatGoodV. BadV. BadBad
MwmblBadBadBadBadBadBad
KagiBadV. BadGreatTerribleBadTerrible
GoogleTerribleV. BadBadBadBadTerrible
BingTerribleTerribleGreatTerribleOkTerrible

Marginalia does relatively well by sometimes providing decent but not great answers and then providing no answers or very obviously irrelevant answers to the questions it can't answer, with a relatively low rate of scams, lower than any other search engine (although, for these queries, ChatGPT returns zero scams and Marginalia returns some).

Interestingly, Mwmbl lets users directly edit search result rankings. I did this for one query, which would score "Great" if it was scored after my edit, but it's easy to do well on a benchmark when you optimize specifically for the benchmark, so Mwmbl's scores are without my edits to the ranking criteria.

One thing I found interesting about the Google results was that, in addition to Google's noted propensity to return recent results, there was a strong propensity to return recent youtube videos. This caused us to get videos that seem quite useless for anybody, except perhaps the maker of the video, who appears to be attempting to get ad revenue from the video. For example, when searching for "ad blocker", one of the youtube results was a video where the person rambles for 93 seconds about how you should use an ad blocker and then googles "ad blocker extension". They then click on the first result and incorrectly say that "it's officially from Google", i.e., the ad blocker is either made by Google or has some kind of official Google seal of approval, because it's the first result. They then ramble for another 40 seconds as they install the ad blocker. After it's installed, they incorrectly state "this is basically one of the most effective ad blocker [sic] on Google Chrome". The video has 14k views. For reference, Steve Yegge spent a year making high-effort videos and his most viewed video has 8k views, with a typical view count below 2k. This person who's gaming the algorithm by making low quality videos on topics they know nothing about, who's part of the cottage industry of people making videos taking advantage of Google's algorithm prioritizing recent content regardless of quality, is dominating Steve Yegge's videos because they've found search terms that you can rank for if you put anything up. We'll discuss other Google quirks in more detail below.

ChatGPT does its usual thing and impressively outperforms its more traditional competitors in one case, does an ok job in another case, refuses to really answer the question in another case, and "hallucinates" nonsense for a number of queries (as usual for ChatGPT, random perturbations can significantly change the results4). It's common to criticize ChatGPT for its hallucinations and, while I don't think that's unfair, as we noted in this 2015, pre-LLM post on AI, I find this general class of criticism to be overrated in that humans and traditional computer systems make the exact same mistakes.

In this case, search engines return various kinds of hallucinated results. In the snow forecast example, we got deliberately fabricated results, one intended to drive ad revenue through shady ads on a fake forecast site, and another intended to trick the user into thinking that the forecast indicates a cold, snowy, winter (the opposite of the actual forecast), seemingly in order to get the user to sign up for unnecessary snow removal services. Other deliberately fabricated results include a site that's intended to look like an objective review site that's actually a fake site designed to funnel you into installing a specific ad blocker, where the ad blocker they funnel you to appears to be a scammy one that tries to get you to pay for ad blocking and doesn't let you unsubscribe, a fake "organic" blog post trying to get you to install a chrome extension that exposes all of your shopping to some service (in many cases, it's not possible to tell if a blog post is a fake or shill post, but in this case, they hosted the fake blog post on the domain for the product and, although it's designed to look like there's an entire blog on the topic, there isn't — it's just this one fake blog post), etc.

There were also many results which don't appear to be deliberately fraudulent and are just run-of-the-mill SEO garbage designed to farm ad clicks. These seem to mostly be pre-LLM sites, so they don't read quite like ChatGPT hallucinations, but they're not fundamentally different. Sometimes the goal of these sites is to get users to click on ads that actually scam the user, and sometimes the goal appears to be to generate clicks to non-scam ads. Search engines also returned many seemingly non-deliberate human hallucinations, where people confidently stated incorrect answers in places where user content is highlighted, like quora, reddit, and stack exchange.

On these queries, even ignoring anything that looks like LLM-generated text, I'd rate the major search engines (Google and Bing) as somewhat worse than ChatGPT in terms of returning various kinds of hallucinated or hallucination-adjacent results. While I don't think concerns about LLM hallucinations are illegitimate, the traditional ecosystem has the problem that the system highly incentivizes putting whatever is most profitable for the software supply chain in front of the user which is, in general, quite different from the best result.

For example, if your app store allows "you might also like" recommendations, the most valuable ad slot for apps about gambling addiction management will be gambling apps. Allowing gambling ads on an addiction management app is too blatantly user-hostile for any company deliberately allow today, but of course companies that make gambling apps will try to game the system to break through the filtering and they sometimes succeed. And for web search, I just tried this again on the web and one of the two major search engines returned, as a top result, ad-laden SEO blogspam for addiction management. At the top of the page is a multi-part ad, with the top two links being "GAMES THAT PAY REAL MONEY" and "GAMES THAT PAY REAL CASH". In general, I was getting localized results (lots of .ca domains since I'm in Canada), so you may get somewhat different results if you try this yourself.

Similarly, if the best result is a good, free, ad blocker like ublock origin, the top ad slot is worth a lot more to a company that makes an ad blocker designed to trick you into paying for a lower quality ad blocker with a nearly-uncancellable subscription, so the scam ad blocker is going to outbid the free ad blocker for the top ad slots. These kinds of companies also have a lot more resources to spend on direct SEO, as well as indirect SEO activities like marketing so, unless search engines mount a more effective effort to combat the profit motive, the top results will go to paid ad blockers even though the paid ad blockers are generally significantly worse for users than free ad blockers. If you talk to people who work on ranking, a lot of the biggest ranking signals are derived from clicks and engagement, but this will only drive users to the best results when users are sophisticated enough to know what the best results are, which they generally aren't. Human raters also rate page quality, but this has the exact same problem.

Many Google employees have told me that ads are actually good because they inform the user about options the user wouldn't have otherwise known about, but anyone who tries browsing without an ad blocker will see ads that are various kinds of misleading, ads that try to trick or entrap the user in various ways, by pretending to be a window, or advertising "GAMES THAT PAY REAL CASH" at the top of a page on battling gambling addiction, which has managed to SEO itself to a high ranking on gambling addiction searches. In principle, these problems could be mitigated with enough resources, but we can observe that trillion dollar companies have chosen not to invest enough resources combating SEO, spam, etc., that these kinds of scam ads are rarely seen. Instead, a number of top results are actually ads that direct you to scams.

In their original Page Rank paper, Sergei Brin and Larry Page noted that ad-based search is inherently not incentive aligned with providing good results:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the Consumers.

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who "deserves" to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. ... This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline’s homepage when the airline’s name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines ... we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.

Of course, Google is now dominated by ads and, despite specifically calling out the insidiousness of user conflating real results with paid results, both Google and Bing have made ads look more and more like real search results, to the point that most users usually won't know that they're clicking on ads and not real search results. By the way, this propensity for users to think that everything is an "organic" search result is the reason that, in this post, results are ordered by the order the appear on the page, so if four ads appear above the first organic result, the four ads will be rank 1-4 and the organic result will be ranked 5. I've heard Google employees say that AMP didn't impact search ranking because it "only" controlled what results went into the "carousel" that appeared above search results, as if inserting a carousel and then a bunch of ads above results, pushing results down below the fold, has no impact on how the user interacts with results. It's also common to see search engines ransoming the top slot for companies, so that companies that don't buy the ad for their own name end up with searches for that company putting their competitors at the top, which is also said to not impact search result ranking, a technically correct claim that's basically meaningless to the median user.

When I tried running the query from the paper, "cellular phone" (no quotes) and, the top result was a Google Store link to buy Google's own Pixel 7, with the rest of the top results being various Android phones sold on Amazon. That's followed by the Wikipedia page for Mobile Phone, and then a series of commercial results all trying to sell you phones or SEO-spam trying to get you to click on ads or buy phones via their links (the next 7 results were commercial, with the next result after that being an ad-laden SEO blogspam page for the definition of a cell phone with ads of cell phones on it, followed by 3 more commercial results, followed by another ad-laden definition of a phone). The commercial links seem very low quality, e.g., the top link below the carousel after wikipedia is Best Buy's Canadian mobile phone page. The first two products there are an ad slots for eufy's version of the AirTag. The next result is for a monthly financed iPhone that's tied to Rogers, the next for a monthly financed Samsung phone that's tied to TELUS, then we have Samsung's AirTag, an monthly financed iPhone tied to Freedom Mobile, a monthly financed iPhone tied to Freedom mobile in a different color, a monthly financed iPhone tied to Rogers, a screen protector for the iPhone 13, another Samsung AirTag product, an unlocked iPhone 12, a Samsung wall charger, etc.; it's an extremely low quality result with products that people shouldn't be buying (and, based on the number of reviews, aren't buying — the modal number of reviews of the top products is 0 and the median is 1 or 2 even though there are plenty of things people do actually buy from Best Buy Canada and plenty of products that have lots of reviews). The other commercial results that show up are also generally extremely low quality results. The result that Sergei and Larry suggested was a great top result, "The Effect of Cellular Phone Use Upon Driver Attention", is nowhere to be seen, buried beneath an avalanche of commercial results. On the other side of things, Google has also gotten into the action by buying ads that trick users, such as paying for an installer to try to trick users into installing Chrome over Firefox.

Anyway, after looking at the results of our test queries, some questions that come to mind are:

The first question could easily be its own post and this post is already 17000 words, so maybe we'll examine it another time. We've previously noted that some individuals can be very productive, but of course the details vary in each case.

On the second question, we looked at a similar question in 2016, both the general version, "I could reproduce this billion dollar company in a weekend", as well as specific comments about how open source software would make it trivial to surpass Google any day now, such as

Nowadays, most any technology you need is indeed available in OSS and in state of the art. Allow me to plug meta64.com (my own company) as an example. I am using Lucene to index large numbers of news articles, and provide search into them, by searching a Lucene index generated by simple scraping of RSS-crawled content. I would claim that the Lucene technology is near optimal, and this search approach I'm using is nearly identical to what a Google would need to employ. The only true technology advantage Google has is in the sheer number of servers they can put online, which is prohibitively expensive for us small guys. But from a software standpoint, Google will be overtaken by technologies like mine over the next 10 years I predict.

and

Scaling things is always a challenge but as long as Lucene keeps getting better and better there is going to be a point where Google's advantage becomes irrelevant and we can cluster Lucene nodes and distribute search related computations on top and then use something like Hadoop to implement our own open source ranking algorithms. We're not there yet but technology only gets better over time and the choices we as developers make also matter. Even though Amazon and Google look like unbeatable giants now don't discount what incremental improvements can accomplish over a long stretch of time and in technology it's not even that long a stretch. It wasn't very long ago when Windows was the reigning champion. Where is Windows now?

In that 2016 post, we saw that people who thought that open source solutions were set to surpass Google any day now appeared to have no idea how many hard problems must be solved to make a mainstream competitor to Google, including real-time indexing of rapidly-updated sites, like Twitter, newspapers, etc., as well as table-stakes level NLP, which is extremely non-trivial. Since 2016, these problems have gotten significantly harder as there's more real-time content to index and users expect much better NLP. The number of things people expect out of their search engine has increased as well, making the problem harder still, so it still appears to be quite difficult to displace Google as a mainstream search engine for, say, a billion users.

On the other hand, if you want to make a useful search engine for a small number of users, that seems easier than ever because Google returns worse results than it used to for many queries. In our test queries, we saw a number of queries where many or most top results were filled with SEO garbage, a problem that was significantly worse than it was a decade ago, even before the rise of LLMs and that continues to get worse. I typically use search engines in a way that doesn't run into this, but when I look at what "normal" users query or if I try naive queries myself, as I did in this post, most results are quite poor, which didn't used to be true.

Another place Google now falls over for me is when finding non-popular pages. I often find that, when I want to find a web page and I correctly remember the contents of the page, even if I do an exact string search, Google won't return the page. Either the page isn't indexed, or the page is effectively not indexed because it lives in some slow corner of the index that doesn't return in time. In order to find the page, I have to remember some text in a page that links to the page (often many clicks removed from the actual page, not just one, so I'm really remembering a page that links to a page that links to a page that links to a page that links to a page and then using archive.org to traverse the links that are now dead), search for that, and then manually navigate the link graph to get to the page. This basically never happened when I searched for something in 2005 and rarely happened in 2015, but this now happens a large fraction of the time I'm looking for something. Even in 2015, Google wasn't actually comprehensive. Just for example, Google search didn't index every tweet. But, at the time, I found Google search better at searching for tweets than Twitter search and I basically never ran across a tweet I wanted to find that wasn't indexed by Google. But now, most of the tweets I want to find aren't returned by Google search5, even when I search for "[exact string from tweet] site:twitter.com". In the original Page Rank paper, Sergei and Larry said "Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now." (and that, while machines can generate an effectively infinite amount of content, just indexing human-generated content seems very useful). Pre-LLM, Google certainly had the resources to index every tweet as well as every human generated utterance on every public website, but they seem to have chosen to devote their resources elsewhere and, relative to its size, the public web appears less indexed than ever, or at least less indexed than it's been since the very early days of web search.

Back when Google returned decent results for simple queries and indexed almost any public page I'd want to find, it would've been very difficult for an independent search engine to return results that I find better than Google's. Marginalia in 2016 would've been nothing more than a curiosity for me since Google would give good-enough results for basically anything where Marginalia returns decent results, and Google would give me the correct result in queries for every obscure page I searched for, something that would be extremely difficult for a small engine. But now that Google effectively doesn't index many pages I want to search for, the relatively small indices that independent search engines have doesn't make them non-starters for me and some of them return less SEO garbage than Google, making them better for my use since I generally don't care about real-time results, don't need fancy NLP (and find that much of it actually makes search results worse for me), don't need shopping integrated into my search results, rarely need image search with understanding of images, etc.

On the question of whether or not a collection of small search engines can provide better results than Google for a lot of users, I don't think this is much of a question because the answer has been a resounding "yes" for years. However, many people don't believe this is so. For example, a Google TLM replied to the bluesky thought leader at the top of this post with

Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.

And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?

presumably referring to arguments like Hillel Wayne's "Algorithm Monocultures", to which our bluesky thought leader replied

like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!

In the past, we looked at some examples where the market leader provides a poor product and various other players, often tiny, provide better products and in a future post, we'll look at how economies of scale and diseconomies of scale interact in various areas for tech but, for this post, suffice it to say that it's clear that despite the common "econ 101" cocktail party idea that economies of scale should be the dominant factor for search quality, that doesn't appear to be the case when we look at actual results.

On the question of whether or not Mwmbl's user-curated results can work, I would guess no, or at least not without a lot more moderation. Just browsing to Mwmbl shows the last edit to ranking was by user "betest", who added some kind of blogspam as the top entry for "RSS". It appears to be possible to revert the change, but there's no easily findable way to report the change or the user as spammy.

On the question of whether or not something like Metacrawler, which aggregated results from multiple search engines, would produce superior results today, that's arguably irrelevant since it would either be impossible to legally run as a commercial service or require prohibitive licensing fees, but it seems plausible that, from a technical standpoint, a modern metacrawler would be fairly good today. Metacrawler quickly became irrelevant because Google returned significantly better results than you would get by aggregating results from other search engines, but it doesn't seem like that's the case today.

Going back to the debate between folks like Xe, who believe that straightforward search queries are inundated with crap, and our thought leader, who believes that "the rending of garments about how even google search is terrible now is pretty overblown", it appears that Xe is correct. Although Google doesn't publicly provide the ability to see what was historically returned for queries, many people remember when straightforward queries generally returned good results. One of the reasons Google took off so quickly in the 90s, even among expert users of AltaVista, who'd become very adept at adding all sorts of qualifiers to queries to get good results, was that you didn't have to do that with Google. But we've now come full circle and we need to add qualifiers, restrict our search to specific sites, etc., to get good results from Google on what used to be simple queries. If anything, we've gone well past full circle since the contortions we need to get good results are a lot more involved than they were in the AltaVista days.

If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor in the company, so you should take this with the usual grain of salt, but if you're looking to join a fast growing early-stage startup, they seem to have found product-market fit and have been growing extremely quickly (revenue-wise).

Thanks to Laurence Tratt, Heath Borders, Justin Blank, Brian Swetland, Viktor Lofgren (who, BTW, I didn't know before writing this post — I only reached out to him to discuss the Marginalia search results after running the queries), Misha Yagudin, @[email protected], Jeremey Kun, and Yossi Kreinin for comments/corrections/discussion

Appendix: Other search engines

Appendix: queries that return good results

I think that most programmers are likely to be able to get good results to every query, except perhaps the tire width vs. grip query, so here's how I found an ok answer to the tire query:

I tried a youtube search, since a lot of the best car-related content is now youtube. A youtube video whose title claims to answer the question (the video doesn't actually answer the question) has a comment recommending Carroll Smith's book "Tune To Win". The comment claims that chapter 1 explains why wider tires have more grip, but I couldn't find an explanation anywhere in the book. Chapter 1 does note that race cars typically run wider tires than passenger cars and that passenger cars are moving towards having wider tires and it make some comments about slip angle that give a sketch of an intuitive reason for why you'd end up with better cornering with a wider contact patch, but I couldn't find a comment that explains differences in braking. Also, the book notes that the primary reason for the wider contact patch is that it (indirectly) allows for more less heat buildup, which then lets you design tires that operate over a narrower temperature range, which allows for softer rubber. That may be true, but it doesn't explain much of the observed behavior one might wonder about.

Tune to Win recommends Kummer's The Unified Theory of Tire and Rubber Friction and Hays and Brooke's (actually Browne, but Smith incorrectly says Brooke) The Physics of Tire Traction. Neither of these really explained what's happening either, but looking for similar books turned up Milliken and Millken's Race Car Vehicle Dynamics, which also didn't really explain why but seemed closer to having an explanation. Looking for books similar to Race Car Vehicle Dynamics turned up Guiggiani's The Science of Vehicle Dynamics, which did get at how to think about and model a number of related factors. The last chapter of Guiggiani's book refers to something called the "brush model" (of tires) and searching for "brush model tire width" turned up a reference to Pacejka's Tire and Vehicle Dynamics, which does start to explain why wider tires have better grip and what kind of modeling of tire and vehicle dynamics you need to do to explain easily observed tire behavior.

As we've noted, people have different tricks for getting good results so, if you have a better way of getting a good result here, I'd be interested in hearing about it. But note that, basically every time I have a post that notes that something doesn't work, the most common suggestion will be to do something that's commonly suggested that doesn't work, even though the post explicitly notes that the commonly suggested thing doesn't work. For example, the most common comment I receive about this post on filesystem correctness is that you can get around all of this stuff by doing the rename trick, even though the post explicitly notes that this doesn't work, explains why it doesn't work, and references a paper which discusses why it doesn't work. A few years later, I gave an expanded talk on the subject, where I noted that people kept suggesting this thing that doesn't work and the most common comment I get on the talk is that you don't need to bother with all of this stuff because you can just do the rename trick (and no, ext4 having auto_da_alloc doesn't mean that this works since you can only do it if you check that you're on a compatible filesystem which automatically replaces the incorrect code with correct code, at which point it's simpler to just write the correct code). If you have a suggestion for the reason wider tires have better grip or for a search which turns up an explanation, please consider making sure that the explanation is not one of the standard incorrect explanations noted in this post and that the explanation can account for all of the behavior that one must be able to account for if one is explaining this phenomenon.

On how to get good results for other queries, since this post is already 17000 words, I'll leave that for a future post on how expert vs. non-expert computer users interact with computers.

Appendix: summary of query results

For each question, answers are ordered from best to worst, with the metric being my subjective impression of how good the result is. These queries were mostly run in November 2023, although a couple were run in mid-December. When I'm running queries, I very rarely write natural language queries myself. However, normal users often write natural language queries, so I arbitrarily did the "Tire" and "Snow" queries as natural queries. Continuing with the theme of running simple, naive, queries, we used the free version of ChatGPT for this post, which means the queries were run through ChatGPT 3.5. Ideally, we'd run the full matrix of queries using keyword and natural language queries for each query, run a lot more queries, etc., but this post is already 17000 words (converting to pages of a standard length book, that would be something like 70 pages), so running the full matrix of queries with a few more queries would pretty quickly turn this into a book-length post. For work and for certain kinds of data analysis, I'll sometimes do projects that are that comprehensive or more comprehensive, but here, we can't cover anything resembling a comprehensive set of queries and the best we can do is to just try a handful of queries that seem representative and use our judgment to decide if this matches the kind of behavior we and other people generally see, so I don't think it's worth doing something like 4x the work to cover marginally more ground.

For the search engines, all queries were run in a fresh incognito window with cleared cookies, with the exception of Kagi, which doesn't allow logged-out searches. For Kagi, the queries were done with a fresh account with no custom personalization or filters, although they were done in sequence with the same account, so it's possible some kind of personalized ranking was applied to the later queries based on the clicks in the earlier queries. These queries were done in Vancouver, BC, which seems to have applied some kind of localized ranking on some search engines.

Appendix: detailed query results

Download youtube videos

For our first query, we'll search "download youtube videos" (Xe's suggested search term, "youtube downloader" returns very similar results). The ideal result is yt-dlp or a thin, free, wrapper around yt-dlp. yt-dlp is a fork of youtube-dlc, which is a now defunct fork of youtube-dl, which seems to have very few updates nowadays.. A link to one of these older downloaders also seems ok if they still work.

Google

  1. Some youtube downloader site. Has lots of assurances that the website and the tool are safe because they've been checked by "Norton SafeWeb". Interacting with the site at all prompts you to install a browser extension and enable notifications. Trying to download any video gives you a full page pop-over for extension installation for something called CyberShield. There appears to be no way to dismiss the popover without clicking on something to try to install it. After going through the links but then choosing not to install CyberShield, no video downloads. Googling "cybershield chrome extension" returns a knowledge card with "Cyber Shield is a browser extension that claims to be a popup blocker but instead displays advertisements in the browser. When installed, this extension will open new tabs in the browser that display advertisements trying to sell software, push fake software updates, and tech support scams.", so CyberShield appears to be badware.
  2. Some youtube downloader site. Interacting with the site causes a pop-up prompting you to download their browser extension. Putting a video URL in causes a pop-up to some scam site but does also cause the video to download, so it seems to be possible to download youtube videos here if you're careful not to engage with the scams the site tries to trick you into interacting with
  3. PC Magazine listicle on ways to download videos from youtube. Top recommendations are paying for youtube downloads, VLC (which they note didn't work when they tried it), some $15/yr software, some $26/yr software, "FlixGrab", then a warning about how the downloader websites are often scammy and they don't recommend any downloader website. The article has more than one ad per suggestion.
  4. Some youtube downloader site with shady pop-overs that try to trick you into clicking on ads before you even interact with the page
  5. Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads
  6. Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads, e.g., "Samantha 24, vancouver | I want sex, write to WhatsApp | Close / Continue". Clicking anything (any button, or anywhere else on the site tries to get you to install something called "Adblock Ultimate"
  7. ZDNet ZDnet listicle. First suggestion is clipware, which apparently bundles a bunch of malware/adware/junkware with the installer: https://www.reddit.com/r/software/comments/w9o1by/warning_about_clipgrab/. The listicle is full of ads and has an autoplay video
  8. [YouTube video] Over 2 minutes of ads followed by a video on how to buy youtube premium (2M views on video)
  9. [YouTube video] Video that starts off by asking users to watch the whole video (some monetization thing?). The video tries to funnel you to some kind of software to download videos that costs money
  10. [YouTube video] PC Magazine video saying that you probably don't "have to" download videos since you can use the share button, and then suggests reading their story (the one in result #3) on how to download videos
  11. Some youtube downloader site with scam ads. Interacting with the site at all tries to get you to install "Adblock Ultimate"
  12. Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads
  13. Some youtube downloader site with scam ads

Out of 10 "normal" results, we have 9 that, in one way or another, try to get you to install badware or are linked to some other kind of ad scam. One page doesn't do this, but it also doesn't suggest the good, free, option for downloading youtube videos and instead suggests a number of paid solutions. We also had three youtube videos, all of which seem to be the video equivalent of SEO blogspam. Interestingly, we didn't get a lot of ads from Google itself despite that happening the last time I tried turning off my ad blocker to do some Google test queries.

Bing

  1. Some youtube downloader site. This is google (2), which has ads for scam sites
  2. [EXPLORE FURTHER ... "Recommended to you based on what's popular"] Some youtube download site, not one we saw from google. Site has multiple pulsing ads and bills itself as "50% off" for Christmas (this search was done in mid-November). Trying to download any video pulls up a fake progress bar with a "too slow? Try [our program] link". After a while, a link to download the video appears, but it's a trick, and when you click it, it tries to install "oWebster Search extension". Googling "oWebster Search extension" indicates that it's badware that hijacks your browser to show ads. Two of the top three hits are how to install the extension and the rest of the top hits are how to remove this badware. Many of the removal links are themselves scams that install other badware. After not installing this badware, clicking the download link again results in a pop-over that tries to get you to install the site's software. If you dismiss the pop-over and click the download link again, you just get the pop-over link again, so this site appears to be a pure scam that doesn't let you download videos
  3. [EXPLORE FURTHER]. Interacting with the site pops up fake ads with photos of attractive women who allegedly want to chat with you. Clicking the video download button tries to get you to install a copycat ad blocker that displays extra pop-over ads. The site does seem to actually give you a video download, though
  4. [EXPLORE FURTHER] Same as (3)
  5. [EXPLORE FURTHER] Same as Google (1) (that NortonSafeWeb youtube downloader site that tries to scam you)
  6. [EXPLORE FURTHER] A site that converts videos to MP4. I didn't check to see if the site works or is just a scam as the site doesn't even claim to let you download youtube videos
  7. Google (1), again. That NortonSafeWeb youtube downloader site that tries to scam you.
  8. [EXPLORE FURTHER] A link to youtube.com (the main page)
  9. [EXPLORE FURTHER] Some youtube downloader site with a popover that tries to trick you into clicking on an ad. Closing that reveals 12 more ads. There's a scam ad that's made to look like a youtube downloader button. If you scroll past that, there's a text box and a button for trying to download a youtube video. Entering a valid URL results in an error saying there's no video that URL.
  10. Gigantic card that actually has a download button. The download button is fake and just takes you to the site. The site loudly proclaims that the software is not adware, spyware, etc.. Quite a few internet commenters note that their antivirus software tags this software as malware. A lot of comments also indicate that the software doesn't work very well but sometimes works. The site for the software has a an embedded youtube video, which displays "This video has been removed for violating YouTube's Terms of Service". Oddly, the download links for mac and Linux are not for this software and in fact don't download anything at all and are installation instructions for youtube-dl; perhaps this makes sense if the windows version is actually malware. The windows download button takes you to a page that lets you download a windows executable. There's also a link to some kind of ad-laden page that tries to trick you into clicking on ads that look like normal buttons
  11. PC magazine listicle
  12. An ad for some youtube downloader program that claims "345,764,132 downloads today"; searching the name of this product on reddit seems to indicate that it's malware
  13. Ad for some kind of paid downloader software

That's the end of the first page.

Like Google, no good results and a lot of scams and software that may not be a scam but is some kind of lightweight skin around an open source project that charges you instead of letting you use the software for free.

Marginalia

  1. 12-year old answer suggesting youtube-dl, which links to a URL which has been taken down and replaced with "Due to a ruling of the Hamburg Regional Court, access to this website is blocked."
  2. Some SEO'd article, like you see on normal search engines
  3. Leawo YouTube Downloader (I don't know what this is, but a quick search at least doesn't make it immediately obvious that this is some kind of badware, unlike the Google and Bing results)
  4. Some SEO'd listicle, like you see on normal search engines
  5. Bug report for some random software
  6. Some random blogger's recommendation for "4K Video Downloader". A quick search seems to indicate that this isn't a scam or badware, but it does lock some features behind a paywall, and is therefore worse than yt-dlp or some free wrapper around yt-dlp
  7. A blog post on how to install and use yt-dlp. The blogpost notes that it used to be about youtube-dl, but has been updated to yt-dlp.
  8. More software that charges you for something you can get for free, although searching for this software on reddit turns up cracks for it
  9. A listicle with bizarrely outdated recommendations, like RealPlayer. The entire blog seems to be full of garbage-quality listicles.
  10. A script to download youtube videos for something called "keyboard maestro", which seems useful if you already use that software, but seems like a poor solution to this problem if you don't already use this software.

The best results by a large margin. The first link doesn't work, but you can easily get to youtube-dl from the first link. I certainly wouldn't try Leawo YouTube Downloader, but at least it's not so scammy that searching for the name of the project mostly returns results about how the project is some kind of badware or a scam, which is better than we got from Google or Bing. And we do get a recommendation with yt-dlp, with instructions in the results that's just a blog post from someone who wants to help people who are trying to download youtube videos.

Kagi

Mwmbl

  1. Some youtube video downloader site, but one that no other search engine returned. There's a huge ad panel that displays "503 NA - Service Deprecating". The download link does nothing except for pop up some other ad panes that then disappear, leaving just the 503 "ad".
  2. $20 software for downloading youtube videos
  3. 2016 blog post on how to install and use youtube-dl. Sidebar has two low quality ads which don't appear to be scams and the main body has two ads interspersed, making this extremely low on ads compared to analogous results we've seen from large search engines
  4. Some youtube video download site. Has a giant banner claiming that it's "the only YouTube Downloader that is 100% ad-free and contains no popups.", which is probably not true, but the site does seem to be ad free and not have pop-ups. Download link seems to actually work.
  5. Youtube video on how to install and use youtube-dlg (a GUI wrapper for youtube-dl) on Linux (this query was run from a Mac).
  6. Link to what was a 2007 blogpost on how to download youtube videos, which automatically forwards to a 2020 ad-laden SEO blogspam listicle with bad suggestions. Article has two autoplay videos. Archive.org shows that the 2007 blog post had some reasonable options in it for the time, so this wasn't always a bad result.
  7. A blog post on a major site that's actually a sponsored post trying to get you to a particular video downloader. Searching for comments on this on reddit indicate that users view the app as a waste of money that doesn't work. The site is also full of scammy and misleading ads for other products. E.g., I tried clicking on an ad that purports to save you money on "products". It loaded a fake "checking your computer" animation that supposedly checked my computer for compatibility with the extension and then another fake checking animation, after which I got a message saying that my computer is compatible and I'm eligible to save money. All I have to do is install this extension. Closing that window opens a new tab that reads "Hold up! Do you actually not want automated savings at checkout" with the options "Yes, Get Coupons" and "No, Don't Save". Clicking "No, Don't Save" is actually an ad that takes you back to a link that tries to get you to install a chrome extension.
  8. That "Norton Safe Web" youtube downloader site, except that the link is wrong and is to the version of the site that purports to download instagram videos instead of the one that purports to download youtube videos.
  9. Link to Google help explaining how you can download youtube videos that you personally uploaded
  10. SEO blogspam. It immediately has a pop-over to get you to subscribe to their newsletter. Closing that gives you another pop-over with the options "Subscribe" and "later". Clicking "later" does actually dismiss the 2nd pop-over. After closing the pop-overs, the article has instructions on how to install some software for windows. Searching for reviews of the software returns comments like "This is a PUP/PUA that can download unwanted applications to your pc or even malicious applications."

Basically the same as Google or Bing.

ChatGPT

Since ChatGPT expects more conversational queries, we'll use the prompt "How can I download youtube videos?"

The first attempt, on a Monday at 10:38am PT returned "Our systems are a bit busy at the moment, please take a break and try again soon.". The second attempt returned an answer saying that one should not download videos without paying for YouTube Premium, but if you want to, you can use third-party apps and websites. Following up with the question "What are the best third-party apps and websites?" returned another warning that you shouldn't use third-party apps and websites, followed by the ironic-for-GPT warning,

I don't endorse or provide information on specific third-party apps or websites for downloading YouTube videos. It's essential to use caution and adhere to legal and ethical guidelines when it comes to online content.

ad blocker

For our next query, we'll try "ad blocker". We'd like to get ublock origin. Failing that, an ad blocker that, by default, blocks ads. Failing that, something that isn't a scam and also doesn't inject extra ads or its own ads. Although what's best may change at any given moment, comparisons I've seen that don't stack the deck have often seemed to show that ublock origin has the best or among the best performance, and ublock origin is free and blocks ads.

Google

  1. "AdBlock — best ad blocker". Below the fold, notes "AdBlock participates in the Acceptable Ads program, so unobtrusive ads are not blocked", so this doesn't block all ads.
  2. Adblock Plus | The world's #1 free ad blocker. Pages notes "Acceptable Ads are allowed by default to support websites", so this also does not block all ads by default
  3. AdBlock. Page notes that " Since 2015, we have participated in the Acceptable Ads program, where publishers agree to ensure their ads meet certain criteria. Ads that are deemed non-intrusive are shown by default to AdBlock users", so this doesn't block all ads
  4. "Adblock Plus - free ad blocker", same as (2), doesn't block all ads
  5. "AdGuard — World's most advanced adblocker!" Page tries to sell you on some kind of paid software, "AdGuard for Mac". Searching for AdGuard turns up a post from this person looking for an ad blocker that blocks ads injected by AdGuard. It seems that you can download it for free, but then, if you don't subscribe, they give you more ads?
  6. "AdBlock Pro" on safari store; has in-app purchases. It looks like you have to pay to unlock features like blocking videos
  7. [YouTube] "How youtube is handling the adblock backlash". 30 second video with 15 second ad before the video. Video has no actual content
  8. [YoutTube] "My thoughts on the youtube adblocker drama"
  9. [YouTube] "How to Block Ads online in Google Chrome for FREE [2023]"; first comment on video is "your video doesnt [sic] tell how to stop Youtube adds [sic]". In the video, a person rambles for a bit and then googles ad blocker extension and then clicks the first link (same as our first link), saying, "If I can go ahead and go to my first website right here, so it's basically officially from Google .... [after installing, as a payment screen pops up asking you to pay $30 or a monthly or annual fee]"
  10. "AdBlock for Mobile" on the App Store. It's rated 3.2* on the iOS store. Lots of reviews indicate that it doesn't really work
  11. MalwareBytes ad blocker. A quick search indicates that it doesn't block all ads (unclear if that's deliberate or due to bugs)
  12. "Block ads in Chrome | AdGuard ad blocker", same as (5)
  13. [ad] NordVPN
  14. [ad] "#1 Best Free Ad Blocker (2024) - 100% Free Ad Blocker." Immediately seems scammy in that it has a fake year (this query was run in mid-November 2023). This is for something called TOTAL Ad Block. Searching for TOTAL Ad Block turns up results indicating that it's a scammy app that doesn't let you unsubscribe and basically tries to steal your money 15 [ad] 100% Free & Easy Download - Automatic Ad Blocker. Actually for Avast browser and not an ad blocker. A quick search show that this browser has a history of being less secure than just running chromium and that it collects an unusually large amount of information from users.

No links to ublock origin. Some links to scams, though not nearly as many as when trying to get a youtube downloader. Lots of links to ad blockers that deliberately only block some ads by default.

Bing

We're now three screens down from the result, so the equivalent of the above google results is just a bunch of ads and then links to one website. The note that something is an ad is much more subtle than I've seen on any other site. Given what we know about when users confuse ads with organic search results, it's likely that most users don't realize that the top results are ads and think that the links to scam ad blockers or the fake review site that tries to funnel you into installing a scam ad blocker are organic search results.

Marginalia

  1. "Is ad-blocker software permissible?" from judaism.stackexchange.com
  2. Blogspam for Ghosterty. Ghostery's pricing page notes that you have to pay for "No Private Sponsored Links", so it seems like some features are behind a pay wall. Wikipedia says "Since July 2018, with version 8.2, Ghostery shows advertisements of its own to users", but it seems like this might be opt-in?
  3. https://shouldiblockads.com/. Explains why you might want to block ads. First recommendation is ublock origin
  4. "What’s the best ad blocker for you? - Firefox Add-ons Blog". First recommendation is ublock origin. Also provides what appears to be accurate information about other ad blockers.
  5. Blog post that's a personal account of why someone installed an ad blocker.
  6. Opera (browser).
  7. Blog post, anti-anti-adblocker polemic.
  8. ublock origin.
  9. Fairphone forum discussion on whether or not one should install an ad blocker.
  10. SEO site blogspam (as in, the site is an SEO optimization site and this is blogspam designed to generate backlinks and funnel traffic to the site).

Probably the best result we've seen so far, in that the third and fourth results suggest ublock origin and the first result is very clearly not an ad blocker. It's unfortunate that the second result is blogspam for Ghostery, but this is still better than we see from Google and Bing.

Mwmbl

  1. A bitly link to a "thinkpiece" on ad blocking from a VC thought leader.
  2. A link to cryptojackingtest, which forwards to Opera (the browser).
  3. A link to ghostery.
  4. Another link to ghostery.
  5. A link to something called 1blocker, which appears to be a paid ad blocker. Searching for reviews turns up comments like "I did 1blocker free trial and forgot to cancel so it signed me up for annual for $20 [sic]" (but comments indicate that the ad blocker does work).
  6. Blogspam for Ad Guard. There's a banner ad offering 40% off this ad blocker.
  7. An extremely ad-laden site that appears to be in the search results because it contains the text "ad blocker detected" if you use an ad blocker (I don't see this text on loading the page, but it's in the page preview on Mwmbl). The first page is literally just ads with a "read more" button. Clicking "read more" takes you to a different page that's full of ads that also has the cartoon, which is the "content".
  8. Another site that appears to be in the search results because it contains the text "ad blocker detected".
  9. Malwarebytes ad blocker, which doesn't appear to work.
  10. HN comments for article on youtube ad blocker crackdown. Scrolling to the 41st comment returns a recommendation for ublock origin.

Mwmbl lets users suggest results, so I tried signing up to add ublock origin. Gmail put the sign-up email into my spam folder. After adding ublock origin to the search results, it's now the #1 result for "ad blocker" when I search logged out, from an incognito window and all other results are pushed down by one. As mentioned above, the score for Mwmbl is from before I edited the search results and not after.

Kagi

Similar quality to Google and Bing. Maybe halfway in between in terms of the number of links to scams.

ChatGPT

Here, we tried the prompt. How do I install the best ad blocker?

First suggestion is ublock origin. Second suggestion is adblock plus. This seems like the best result by a significant margin.

download firefox

Google

Mostly good links, but 2 out of the top 10 links are scams. And we didn't have a repeat of this situation I saw in 2017, where Google paid to get ranked above Firefox in a search for Firefox. For search queries where almost every search engine returns a lot of scams, I might rate having 2 out of the top 10 links be scams as "Ok" or perhaps even better but, here, where most search engines return no fake or scam links, I'm rating this as "Bad". You could make a case for "Ok" or "Good" here by saying that the vast majority of users will click one of the top links and never get as far as the 7th link, but I think that if Google is confident enough that's the case that they view it as unproblematic that the 7th and 10th links are scams, they should just only serve up the top links.

Bing

That's the entire first page. Seems pretty good. Nothing that looks like a scam.

Marginalia

Definitely worse than Bing, since none of the links are to download Firefox. Depending on how highly you rate users not getting scammed vs. having the exact right link, this might be better or worse than Google. In this post, this scams are relatively highly weighted, so Marginalia ranks above Google here.

Mwmbl

kagi.com

Maybe halfway in between Bing and Marginalia. No scams, but a lot of irrelevant links. Unlike some of the larger search engines, these links are almost all to download the wrong version of firefox, e.g., I'm on a Mac and almost all of the links are for windows downloads.

ChatGPT

The prompt "How do I download firefox?" returned technically incorrect instructions on how to download firefox. The instructions did start with going to the correct site, at which point I think users are likely to be able to download firefox by looking at the site and ignoring the instructions. Seems vaguely similar to marginalia, in that you can get to a download by clicking some links, but it's not exactly the right result. However, I think users are almost certain to find the correct steps and only likely with Marginalia, so ChatGPT is rated more highly than Marginalia for this query.

Why do wider tires have better grip?

Any explanation that's correct must, a minimum, be consistent with the following:

This is one that has a lot of standard incorrect or incomplete answers, including:

Google

Bing

From skimming further, many of the other links are the same links as above. No link appears to answer the question.

Marginalia

Original query returns zero results. Removing the question mark returns one single result, which is the same as (3) and (4) from bing.

Mwmbl

  1. NYT article titled "Why Women Pay Higher Interest". This is the only returned result.

Removing the question mark returns an article about bike tires titled "Fat Tires During the Winter: What You Need to Know"

Kagi

  1. A knowledge card that incorrectly reads "wider tire has a greater contact patch with the ground, so can provide traction."
  2. (50) from google
  3. Reddit question with many incorrect answers
  4. Reddit question with many incorrect answers. Top answer is "The same reason that pressing your hand on the desk and sliding it takes more effort than doing the same with a finger. More rubber on the road = more friction".
  5. (3) and (4) from bing
  6. Youtube video titled "Do wider tyres give you more grip?". Clicking the video gives you 1:30 in ads before the video plays. The video is good, but it answers the question in the title of the video and not the question being asked of why this is the case. The first ad appears to be an ad revenue scam. The first link actually takes you to a second link, where any click takes you through some ad's referral link to a product.
  7. "This is why wider tires equals more grip". SEO blogspam for (6)
  8. SEO blogspam for another youtube video
  9. SEO blogspam for (6)
  10. Quora answer where top answer doesn't answer the question and I can't read all of the answers because I'm not logged in or aren't a premium member or something.
  11. Google (56), stolen text from other sites and a site that has popovers that try to trick you into clicking ads
  12. Pre-chat GPT nonsense text and a page that's full of ads. Unusually, the few ads that I clicked on seemed to be normal ads and not scams.
  13. Blogspam for ad farm that has pop-overs that try to get you to install badware.
  14. Page with ChatGPT-sounding nonsense. Has a "Last updated" timestamp that's sever-side generated to match the exact moment you navigated to the page. Page tries to trick you into clicking on ads with full-page popover. Ads don't seem to be scams, as far as I can tell.
  15. Page which incorrectly states "In summary, a wider tire does not give better traction, it is the same traction similar to a more narrow tire.". Has some ads that get you to try to install badware.

ChatGPT

Provides a list of "hallucinated" reasons. The list of reasons has better grammar than most web search results, but still incorrect. It's not surprising that ChatGPT can't answer this question, since it often falls over on questions that are both easier to reason about and where the training data will contain many copies of the correct answer, e.g., Joss Fong noted that, when her niece asked ChatGPT about gravity, the response was nonsense: "... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier."

Overall, no search engine gives correct answers. Marginalia seems to be the best here in that it gives only a couple of links to wrong answers and no links to scams.

Why do they keep making cpu transistors smaller?

I had this question when I was in high school and my AP physics teacher explained to me that it was because making the transistors smaller allowed the CPU to be smaller, which let you make the whole computer smaller. Even at age 14, I could see that this was an absurd answer, not really different than today's ChatGPT hallucinations — at the time, computers tended to be much larger than they are now, and full of huge amounts of empty space, with the CPU taking up basically no space relative to the amount of space in the box and, on top of that, CPUs were actually getting bigger and not smaller as computers were getting smaller. I asked some other people and didn't really get an answer. This was also relatively early on the life of the public web and I wasn't able to find an answer other than something like "smaller transistors are faster" or "smaller = less capacitance". But why are they faster? And what makes them have less capacitance? Specifically, what about the geometry causes that to scale so that transistors get faster? It's not, in general, obvious that things should get faster if you shrink them, e.g., if you naively linearly shrink a wire, it doesn't appear that it should get faster at all because the cross sectional area is reduced quadratically, increasing resistance per distance quadratically. But length is also reduced linearly, so total resistance is increased linearly. And then capacitance also decreases linearly, so it all cancels out. Anyway, for transistors, it turns out the same kind of straightforward scaling logic shows that they speed up (at back then, transistors were large enough and wire delay was relatively small enough that you got extremely large increases in performance for shrinking transistor). You could explain this to a high school student who's taken physics in a few minutes if you had the right explanation, but I couldn't find an answer to this question until I read a VLSI textbook.

There's now enough content on the web that there must be multiple good explanations out there. Just to check, I used non-naive search terms to find some good results. Let's look at what happens when you use the naive search from above, though.

Google

Bing

Kagi

Marginalia

No results

Mwmbl

  1. A link to a Vox article titled "Why do artists keep making holiday albums?". This is the only result.

ChatGPT

Has non-answers like "increase performance". Asking ChatGPT to expand on this, with "Please explain the increased performance." results in more non-answers as well as fairly misleading answers, such as

Shorter Interconnects: Smaller transistors result in shorter distances between them. Shorter interconnects lead to lower resistance and capacitance, reducing the time it takes for signals to travel between transistors. Faster signal propagation enhances the overall speed and efficiency of the integrated circuit ... The reduced time it takes for signals to travel between transistors, combined with lower power consumption, allows for higher clock frequencies

I could see this seeming plausible to someone with no knowledge of electrical engineering, but this isn't too different from ChatGPT's explanation of gravity, "... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier."

vancouver snow forecast winter 2023

Good result: Environment Canada's snow forecast, predicting significantly below normal snow (and above normal temperatures)

Google

  1. Knowledge card from a local snow removal company, incorrectly stating "The forecast for the 2023/2024 season suggests that we can expect another winter marked by ample snowfall and temperatures hovering both slightly above and below the freezing mark. Be prepared ahead of time.". On opening the page, we see that the next sentence is "Have Alblaster [the name of the company] ready to handle your snow removal and salting. We have a proactive approach to winter weather so that you, your staff and your customers need not concern yourself with the approaching storms." and the goal of the link is to get you to buy snow removal services regardless of their necessity by writing a fake forecast.
  2. [question dropdown] "What is the winter prediction for Vancouver 2023?", incorrectly saying that it will be "quite snowy".
  3. [question dropdown] "What kind of winter is predicted for 2023 Canada?" Links to a forecast of Ontario's winter, so not only wrong province, but the wrong coast, and also not actually an answer to the question in the dropdown.
  4. [question dropdown] "What is the winter prediction for B.C. in 2023 2024?" Predicts that B.C. will have a wet and mild winter, which isn't wrong, but doesn't really answer the question.
  5. [question dropdown] "What is the prediction for 2023 2024 winter?" Has a prediction for U.S. weather
  6. Blogspam article that has a lot of pointless text with ads all over. Text is contradictory in various ways and doesn't answer the question. Has huge pop-over ad that covers top half the page
  7. Another blogspam article from the same source. Lots of ads; doesn't answer the question
  8. Ad-laden article that answers some related questions, but not this question
  9. Extremely ad-laden article that's almost unreadable due to the number of ads. Talks a lot about El Nino. Eventually notes that we should see below-normal snow in B.C. due to El Nino, but B.C. is almost 100M km² and the forecast is not the same for all of B.C., so you could maybe hope that the comment about B.C. here applies to Vancouver, but this link only lets you guess at the answer
  10. Very ad-laden article, but does have a map which has map that's labeled "winter precipitation" which appears to be about snow and not rain. Map seems quite different from Environment Canada's map, but it does show reduced "winter precipitation" over Vancouver, so you might conclude the right thing from this map.

Bing

Kagi

Marginalia

No results.

Mwmbl

ChatGPT

"What is the snow forecast for Vancouver in winter of 2023?"

Doesn't answer questions, recommends using a website, app, or weather service.

Asking "Could you please direct me to a weather website, app, or weather service that has the forecast?" causes ChatGPT to return random weather websites that don't have a seasonal snow forecast.

I retried a few times. One time, I accidentally pasted in the entire ChatGPT question, which meant that my question was prepened with "User\n". That time, ChatGPT suggested "the Canadian Meteorological Centre, Environment Canada, or other reputable weather websites". The top response when asking for the correct website was "Environment Canada Weather", which at least has a reasonable seeming seasonal snow forecast somewhere on the website. The other links were still to sites that aren't relevant.

Appendix: Google "knowledge card" results

In general, I've found Google knowledge card results to be quite poor, both for specific questions with easily findable answers as well as for silly questions like "when was running invented" which, for years, infamously returned "1748. Running was invented by Thomas Running when he tried to walk twice at the same time" (which was pulled from a Quora answer).

I had a doc where I was collecting every single knowledge card I saw to tabulate the fraction that were correct. I don't know that I'll ever turn that into a post, so here are some "random" queries with their knowledge card result (and, if anyone is curious, most knowledge card results I saw when I was tracking this were incorrect).

Appendix: FAQ

As already noted, the most common responses I get are generally things that are explicitly covered in the post, so I won't recover those here. However, any time I write a post that looks at anything, I also get a slew of comments like and, indeed, that was one of the first comments I got on this post.

This isn't a peer-reviewed study, it's crap

As I noted in this other post,

There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.

When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my ability) even if it makes the result look less interesting or less likely to win an award.

The same thing applies here and, in fact, I have a best paper award in this field (information retrieval, or IR, colloquially called search). I don't find IR papers particularly rigorous. I did push very hard to make my top-conference best-paper-award-wining paper more rigorous and, while I won some of those fights, I lost others, and that paper has a number of issues that I wouldn't let pass in a blog post. I suspect that people who make comments like this mostly don't read papers and, to the extent they do, don't understand them.

Another common response is

Your table is wrong. I tried these queries on Kagi and got Good results for the queries [but phrase much more strongly]

I'm not sure why people feel so strongly about Kagi but, all of these kinds of responses so far have come from Kagi users. No one has gotten good results for the tire, transistor, or snow queries (note, again, that this is not a query looking for a daily forecast, as clearly implied by the "winter 2023" in the query), nor are the results for the other queries very good if you don't have an ad blocker. I suppose it's possible that the next person who tells me this actually has good results, but that seems fairly unlikely given the zero percent correctness rate so far.

For example, one user claimed that the results were all good, but they pinned GitHub results and only ran the queries for which you'd get a good result on GitHub. This is actually worse than you get if you use Google or Bing and write good queries since you'll get noise in your results when GitHub is the wrong place to search. Of course you make a similar claim that Bing is amazing is you write non-naive queries, so it's curious that so many Kagi users are angrily writing me about this and no Google or Bing users. Kagi appears to have tapped into the same vein that Tesla and Apple have managed to tap into, where users become incensed that someone is criticizing something they love and then write nonsensical defenses of their favorite product, which bodes well for Kagi. I've gotten comments like this from not just one Kagi user, but many.


  1. this person does go on to say ", but it is true that a lot of, like, tech industry/trade stuff has been overwhelmed by LLM-generated garbage". However, the results we see in this post generally seem to be non-LLM generated text, often pages pre-dating LLMs and low quality results don't seem confined to or even particularly bad in tech-related areas. Or, to pick another example, our bluesky thought leader is in a local Portland band. If I search "[band name] members", I get a knowledge card which reads "[different band name] is a UK indie rock band formed in Glastonbury, Somerset. The band is composed of [names and instruments]." [return]
  2. For example, for a youtube downloader, my go-to would be to search HN, which returns reasonable results. Although that works, if it didn't, my next step would be to search reddit (but not using reddit search, of course), which returns a mix of good and bad results; searching for info about each result shows that the 2nd returned result (yt-dlp) is good and most of the other results are quite bad. Other people have different ways of getting good results, e.g., Laurence Tratt's reflex is to search for "youtube downloader cli" and Heath Borders's is to search for "YouTube Downloader GitHub"; both of those searches work decently as well. If you're someone whose bag of tricks includes the right contortions to get good results for almost any search, it's easy to not realize that most users don't actually know how to do this. From having watched non-expert users try to use computers with advice from expert users, it's clear that many sophisticated users severely underestimate how much knowledge they have. For example, I've heard many programmers say that they're good at using computers because "I just click on random things to see what happens". Maybe so, but when they give this advice to naive users, this generally doesn't go well and the naive users will click on the wrong random things. The expert user is not, in fact, just clicking on things at random; they're using their mental model of what clicks might make sense to try clicks that could make sense. Similarly with search, where people will give semi-plausible sounding advice like "just add site:reddit.com to queries". But adding "site:reddit.com" that makes many queries worse instead of better — you have to have a mental model of which queries this works on and which queries this fails on.

    When people have some kind of algorithm that they consistently use, it's often one that has poor results that is also very surprising to technical folks. For example, Misha Yagudin noted, "I recently talked to some Russian emigrates in Capetown (two couples have travel agencies, and another couple does RUB<>USDT<>USD). They were surprised I am not on social media, and I discovered that people use Instagram (!!) instead of Google to find products and services these days. The recipe is to search for something you want 'triathlon equipment,' click around a bit, then over the next few days you will get a bunch of recommendations, and by clicking a bit more you will get even better recommendations. This was wild to me."

    [return]
  3. she did better than naive computer users, but still had a lot of holes in her mental model that would lead to installing malware on her machine. For what it's like for normal computer users, the internet is full of stories from programmers like "The number of times I had to yell at family members to NOT CLICK THAT ITS AN AD is maddening. It required getting a pretty nasty virus and a complete wipe to actually convince my dad to install adblock.". The internet is full of scam ads that outrank search that install malware and a decent fraction of users are on devices that have been owned by clicking on an ad or malicious SEO'd search result and you have to constantly watch most users if you want to stop their device from being owned. [return]
  4. accidentally prepending "User\n" to one query got it to return a good result instead of bad results, reminiscent of how ChatGPT "thought" Colin Percival was dead if you asked it to "write about" him, but alive if you asked it to "Write about" him. It's already commonplace for search ranking to be done with multiple levels of ranking, so perhaps you could get good results by running randomly perturbed queries and using a 2nd level ranker, or ChatGPT could even have something like this built in. [return]
  5. some time after Google stopped returning every tweet I wanted to find, Twitter search worked well enough that I could find tweets with Twitter search. However, post-acquisition, Twitter search often doesn't work in various ways. For maybe 3-5 months, search didn't return any of my tweets at all. And both before and after that period, searches often fail to return a tweet even when I search for an exact substring of a tweet, so now I often have to resort to various weird searches for things that I expect to link to the tweet I'm looking for so I can manually follow the link to get to the tweet. [return]

Transcript of Elon Musk on stage with Dave Chapelle

2022-12-11 08:00:00

This is a transcription of videos Elon Musk's appearance on stage with Dave Chapelle using OpenAI's Whisper model with some manual error corrections and annotations for crowd noise.

As with the Exhibit H Twitter text message release, there are a lot of articles that quote bits of this, but the articles generally missing a lot of what happened and often paint a misleading picture of happened and the entire thing is short enough that you might as well watch or read it instead of reading someone's misleading summary. In general, the media seems to want to paint a highly unflattering picture of Elon, resulting in articles and virtual tweets that are factually incorrect. For example, it's been widely incorrectly reported that, during the "I'm rich, bitch" part, horns were played to drown out the crowd's booing of Elon, but the horn sounds were played when the previous person said the same thing, which was the most cheered statement that was recorded. The sounds are much weaker when Elon says "I'm rich, bitch" and can't be heard clearly, but it sounds like a mix of booing and cheering. It was probably the most positive crowd response that Elon got from anything and it seems inaccurate in at least two ways to say that horns were played to drown out the booing Elon was receiving. On the other hand, even though the media has tried to paint as negative a picture of Elon as possible, it's done quite a poor job and a boring, accurate, accounting of what happened in many of other sections are much less flattering than the misleading summaries that are being passed around.

Chat log exhibits from Twitter v. Musk case

2022-10-01 08:00:00

This is a scan/OCR of Exhibits H and J from the Twitter v. Musk case, with some of the conversations de-interleaved and of course converted from a fuzzy scan to text to make for easier reading.

I did this so that I could easily read this and, after reading it, I've found that most accountings of what was said are, in one way or another, fairly misleading. Since the texts aren't all that long, if you're interested in what they said, I would recommended that you just read the texts in their entirety (to the extent they're available — the texts make it clear that some parts of conversations are simply not included) instead of reading what various journalists excerpted, which seems to sometimes be deliberately misleading because selectively quoting allows them to write a story that matches their agenda and sometimes accidentally misleading because they don't know what's interesting about the texts.

If you want to compare these conversations to other executive / leadership conversations, you can compare them to Microsoft emails and memos that came out of the DoJ case against Microsoft and the Enron email dataset.

Since this was done using OCR, it's likely there are OCR errors. Please feel free to contact me if you see an error.

Exhibit H

Exihibit J

Equity / financing commitments from 2022-05-05 SEC filing

If you're curious about the outcomes of the funding discussions above, the winners are listed in the Schedule 13D

Thanks to @tech31842, @agentwaj, and mr. zip for OCR corrections

Futurist prediction methods and accuracy

2022-09-12 08:00:00

I've been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate these predictions of the future are. The timeframe of predictions that are so far out means that only a tiny fraction of people making those kinds of predictions today have a track record so, if we want to evaluate which predictions are plausible, we need to look at something other than track record.

The idea behind the approach of this post was to look at predictions from an independently chosen set of predictors (Wikipedia's list of well-known futurists1) whose predictions are old enough to evaluate in order to understand which prediction techniques worked and which ones didn't work, allowing us to then (mostly in a future post) evaluate the plausibility of predictions that use similar methodologies.

Unfortunately, every single predictor from the independently chosen set had a poor record and, on spot checking some predictions from other futurists, it appears that futurists often have a fairly poor track record of predictions so, in order to contrast techniques that worked with techniques that I didn't, I sourced predictors that have a decent track record from my memory, an non-independent source which introduces quite a few potential biases.

Something that gives me more confidence than I'd otherwise have is that I avoided reading independent evaluations of prediction methodologies until after I did the evaluations for this post and wrote 98% of the post and, on reading other people's evaluations, I found that I generally agreed with Tetlock's Superforecasting on what worked and what didn't work despite using a wildly different data set.

In particular, people who were into "big ideas" who use a few big hammers on every prediction combined with a cocktail party idea level of understanding of the particular subject to explain why a prediction about the subject would fall to the big hammer generally fared poorly, whether or not their favored big ideas were correct. Some examples of "big ideas" would be "environmental doomsday is coming and hyperconservation will pervade everything", "economic growth will create near-infinite wealth (soon)", "Moore's law is supremely important", "quantum mechanics is supremely important", etc. Another common trait of poor predictors is lack of anything resembling serious evaluation of past predictive errors, making improving their intuition or methods impossible (unless they do so in secret). Instead, poor predictors often pick a few predictions that were accurate or at least vaguely sounded similar to an accurate prediction and use those to sell their next generation of predictions to others.

By contrast, people who had (relatively) accurate predictions had a deep understanding of the problem and also tended to have a record of learning lessons from past predictive errors. Due to the differences in the data sets between this post and Tetlock's work, the details are quite different here. The predictors that I found to be relatively accurate had deep domain knowledge and, implicitly, had access to a huge amount of information that they filtered effectively in order to make good predictions. Tetlock was studying people who made predictions about a wide variety of areas that were, in general, outside of their areas of expertise, so what Tetlock found was that people really dug into the data and deeply understood the limitations of the data, which allowed them to make relatively accurate predictions. But, although the details of how people operated are different, at a high-level, the approach of really digging into specific knowledge was the same.

Because this post is so long, this post will contain a very short summary about each predictor followed by a moderately long summary on each predictor. Then we'll have a summary of what techniques and styles worked and what didn't work, with the full details of the prediction grading and comparisons to other evaluations of predictors in the appendix.

Ray Kurzweil

Ray Kurzweil has claimed to have an 86% accuracy rate on his predictions, a claim which is often repeated, such as by Peter Diamandis where he says:

Of the 147 predictions that Kurzweil has made since the 1990's, fully 115 of them have turned out to be correct, and another 12 have turned out to be "essentially correct" (off by a year or two), giving his predictions a stunning 86% accuracy rate.

The article is titled "A Google Exec Just Claimed The Singularity Will Happen by 2029" opens with "Ray Kurzweil, Google's Director of Engineering, is a well-known futurist with a high-hitting track record for accurate predictions." and it cites this list of predictions on wikipedia. 86% is an astoundingly good track record for non-obvious, major, predictions about the future. This claim seems to be the source of other people claiming that Kurzweil has a high accuracy rate, such as here and here. I checked the accuracy rate of the wikipedia list Diamandis cited myself (using archive.org to get the list from when his article was published) and found a somewhat lower accuracy of 7%.

Fundamentally, the thing that derailed so many of Kurzweil's predictions is that he relied on the idea of exponential and accelerating growth in basically every area he can imagine, and even in a number of areas that have had major growth, the growth didn't keep pace with his expectations. His basic thesis is that not only do we have exponential growth due to progress (improve technologically, etc.), improvement in technology feeds back into itself, causing an increase in the rate of exponential growth, so we have double exponential growth (as in e^x^x, not 2*e^x) in many important areas, such as computer performance. He repeatedly talks about this unstoppable exponential or super exponential growth, e.g., in his 1990 book, The Age of Intelligent Machines, he says "One reliable prediction we can make about the future is that the pace of change will continue to accelerate" and he discusses this again in his 1999 book, The Age of Spiritual Machines, his 2001 essay on accelerating technological growth, titled "The Law of Accelerating Returns", his 2005 book, The Singularity is Near, etc.

One thing that's notable is despite the vast majority of his falsifiable predictions from earlier work being false, Kurzweil continues to use the same methodology to generate new predictions each time, which is reminiscent of Andrew Gelman's discussion of forecasters who repeatedly forecast the same thing over and over again in the face of evidence that their old forecasts were wrong. For example, in his 2005 The Singularity is Near, Kurzweil notes the existence of "S-curves", where growth from any particular "thing" isn't necessarily exponential, but, as he did in 1990, concludes that exponential growth will continue because some new technology will inevitably be invented which will cause exponential growth to continue and that "The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them".

In 2001, he uses this method to plot a graph and then predicts unbounded life expectancy by 2011 (the quote below isn't unambiguous on life expectancy being unbounded, but it's unambiguous if you read the entire essay or his clarification on his life expectancy predictions, where he says "I don’t mean life expectancy based on your birthdate, but rather your remaining life expectancy"):

Most of you (again I’m using the plural form of the word) are likely to be around to see the Singularity. The expanding human life span is another one of those exponential trends. In the eighteenth century, we added a few days every year to human longevity; during the nineteenth century we added a couple of weeks each year; and now we’re adding almost a half a year every year. With the revolutions in genomics, proteomics, rational drug design, therapeutic cloning of our own organs and tissues, and related developments in bio-information sciences, we will be adding more than a year every year within ten years.

Kurzweil pushes the date this is expected to happen back by more than one year per year (the last citation I saw on this was a 2016 prediction that we would have unbounded life expectancy by 2029), which is characteristic of many of Kurzweil's predictions.

Quite a few people have said that Kurzweil's methodology is absurd because exponential growth can't continue indefinitely in the real world, but Kurzweil explains why he believes this is untrue in his 1990 book, The Age of Intelligent Machines:

A remarkable aspect of this new technology is that it uses almost no natural resources. Silicon chips use infinitesimal amounts of sand and other readily available materials. They use insignificant amounts of electricity. As computers grow smaller and smaller, the material resources utilized are becoming an inconsequential portion of their value. Indeed, software uses virtually no resources at all.

That we're entering a world of natural resource abundance because resources and power are irrelevant to computers hasn't been correct so far, but luckily for Kurzweil, many of the exponential and double exponential processes he predicted would continue indefinitely stopped long before natural resource limits would come into play, so this wasn't a major reason Kurzweil's predictions have been wrong, although it would be if his predictions were less inaccurate.

At a meta level, one issue with Kurzweil's methodology is that he has a propensity to "round up" to make growth look faster than it is in order to fit the world to his model. For example, in "The Law of Accelerating Returns", we noted that Kurzweil predicted unbounded lifespan by 2011 based on accelerating lifespan when "now we’re adding almost a half a year every year" in 2001. However, life expectancy growth in the U.S. (which, based on his comments, seems to be most of what Kurzweil writes about) was only 0.2 years per year overall and 0.1 years per year in longer lived demographics and worldwide life expectancy was 0.3 years per year. While it's technically true that you can round 0.3 to 0.5 if you're rounding to the nearest 0.5, that's a very unreasonable thing to do when trying to guess when unbounded lifespan will happen because the high rate of worldwide increase life expectancy was mostly coming from "catch up growth" where there was a large reduction in things that caused "unnaturally" shortened lifespans.

If you want to predict what's going to happen at the high end, it makes more sense to look at high-end lifespans, which were increasing much more slowly. Another way in which Kurzweil rounded up to get his optimistic prediction was to select a framing that made it look like we were seeing extremely rapid growth in life expectancies. But if we simply plot life expectancy over time since, say, 1950, we can see that growth is mostly linear-ish trending to sub-linear (and this is true even if we cut the graph off when Kurzweil was writing in 2001), with some super-linear periods that trend down to sub-linear. Kurzweil says he's a fan of using indexes, etc., to look at growth curves, but in this case where he can easily do so, he instead chooses to pick some numbers out of the air because his "standard" methodology of looking at the growth curves results in a fairly boring prediction of lifespan growth slowing down, so there are three kinds of rounding up in play here (picking an unreasonably optimistic number, rounding up that number, and then selectively not plotting a bunch of points on the time series to paint the picture Kurzweil wants to present).

Kurzweil's "rounding up" is also how he came up with the predictions that, among other things, computer performance/size/cost and economic growth would follow double exponential trajectories. For computer cost / transistor size, Kurzweil plotted, on a log scale, a number of points on the silicon scaling curve, plus one very old point from the pre-silicon days, when transistor size was on a different scaling curve. He then fits what appears to be a cubic to this, and since a cubic "wants to" either have high growth or high anti-growth in the future, and the pre-silicon point puts pulls the cubic fit very far down in the past, the cubic fit must "want to" go up in the future and Kurzweil rounds up this cubic growth to exponential. This was also very weakly supported by the transistor scaling curve at the time Kurzweil was writing. As someone who was following ITRS roadmaps at the time, my recollection is that ITRS set a predicted Moore's law scaling curve and semiconductor companies raced to beat curve, briefly allowing what appeared to be super-exponential scaling since they would consistently beat the roadmap, which was indexed against Moore's law. However, anyone who actually looked at the details of what was going on or talked to semiconductor engineers instead of just looking at the scaling curve would've known that people generally expected both that super-exponential scaling was temporary and not sustainable and that the end of Dennard scaling as well as transistor-delay dominated (as opposed to interconnect delay-dominated) high-performance processors were imminent, meaning that exponential scaling of transistor sizes would not lead to the historical computer performance gains that had previously accompanied transistor scaling; this expectation was so widespread that it was discussed in undergraduate classes at the time. Anyone who spent even the briefest amount of time looking into semiconductor scaling would've known these things at the time Kurzweil was talking about how we were entering an era of double exponential scaling and would've thought that we would be lucky to even having general single exponential scaling of computer performance, but since Kurzweil looks at the general shape of the curve and not the mechanism, none of this knowledge informed his predictions, and since Kurzweil rounds up the available evidence to support his ideas about accelerating acceleration of growth, he was able to find a selected set of data points that supported the curve fit he was looking for.

We'll see this kind of rounding up done by other futurists discussed here, as well as longtermists discussed in the appendix, and we'll also see some of the same themes over and over again, particularly exponential growth and the idea that exponential growth will lead to even faster exponential growth due to improvements in technology causing an acceleration of the rate at which technology improves.

Jacque Fresco

In 1969, Jacque Fresco wrote Looking Forward. Fresco claims it's possible to predict the future by knowing what values people will have in the future and then using that to derive what the future will look like. Fresco doesn't describe how one can know the values people will have in the future and assumes people will have the values he has, which one might describe as 60s/70s hippy values. Another major mechanism he uses to predict the future is the idea that people of the future will be more scientific and apply the scientific method.

He writes about how "the scientific method" is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply "the scientific method" in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have "a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking".

This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures.

The third major mechanism Fresco uses to derive his predictions is the idea that computers will be able solve basically any problem one can imagine and that manufacturing technology will also progress similarly.

Each of the major mechanisms that are in play in Fresco's predictions are indistinguishable from magic. If you can imagine a problem in the domain, the mechanism is able to solve it. There are other magical mechanisms in play as well, generally what was in the air at the time. For example, behaviorism and operant conditioning were very trendy at the time, so Fresco assumes that society at large will be able to operant condition itself out of any social problems that might exist.

Although most of Fresco's predictions are technically not yet judgable because they're about the far future, for the predictions he makes whose time has come, I didn't see one accurate prediction.

Buckminster Fuller

Fuller is best known for inventing the geodesic dome, although geodesic domes were actually made by Walther Bauersfeld decades before Fuller "invented" them. Fuller is also known for a variety of other creations, like the Dymaxion car, as well as his futurist predictions.

I couldn't find a great source of a very long list of predictions from Fuller, but I did find this interview, where he makes a number of predictions. Fuller basically free associates with words, making predictions by riffing off of the English meaning of the word (e.g., see the teleportation prediction) or sometimes an even vaguer link.

Predictions from the video:

For those who've heard that Fuller predicted the creation of Bitcoin, that last prediction about an accounting system for wealth is the one people are referring to. Typically, people who say this haven't actually listened to the interview where he states the whole prediction and are themselves using Fuller's free association method. Bitcoin comes from spending energy to mine Bitcoin and Fuller predicted that the future would have a system of wealth based on energy, therefore Fuller predicted the creation of Bitcoin. If you actually listen to the interview, Bitcoin doesn't even come close to satisfying the properties of the system Fuller describes, but that doesn't matter if you're doing Fuller-style free association.

In this post, Fuller has fewer predictions graded than almost anyone else, so it's unclear what his accuracy would be if we had a list of, say, 100 predictions, but the predictions I could find have a 0% accuracy rate.

Michio Kaku

Among people on Wikipedia's futurist list, Michio Kaku is probably relatively well known because, as part of his work on science popularization, he's had a nationally (U.S.) syndicated radio show since 2006 and he frequently appears on talk shows and is interviewed by news organizations.

In his 1997 book, Visions: How Science Will Revolutionize the 21st Century, Kaku explains why predictions from other futurists haven't been very accurate and why his predictions are different:

... most predictions of the future have floundered because they have reflected the eccentric, often narrow viewpoints of a single individual.

The same is not true of Visions. In the course of writing numerous books, articles, and science commentaries, I have had the rare privilege of interviewing over 150 scientists from various disciplines during a ten-year period.

On the basis of these interviews, I have tried to be careful to delineate the time frame over which certain predictions will or will not be realized. Scientists expect some predictions to come about by the year 2020; others will not materialize until much later—from 2050 to the year 2100.

Kaku also claims that his predictions are more accurate than many other futurists because he's a physicist and thinking about things in the ways that physicists do allows for accurate predictions of the future:

It is, I think, an important distinction between Visions, which concerns an emerging consensus among the scientists themselves, and the predictions in the popular press made almost exclusively by writers, journalists, sociologists, science fiction writers, and others who are consumers of technology, rather than by those who have helped to shape and create it. ... As a research physicist, I believe that physicists have been particularly successful at predicting the broad outlines of the future. Professionally, I work in one of the most fundamental areas of physics, the quest to complete Einstein’s dream of a “theory of everything.” As a result, I am constantly reminded of the ways in which quantum physics touches many of the key discoveries that shaped the twentieth century.

In the past, the track record of physicists has been formidable: we have been intimately involved with introducing a host of pivotal inventions (TV, radio, radar, X-rays, the transistor, the computer, the laser, the atomic bomb), decoding the DNA molecule, opening new dimensions in probing the body with PET, MRI, and CAT scans, and even designing the Internet and the World Wide Web.

He also specifically calls out Kurzweil's predictions as absurd, saying Kurzweil has "preposterous predictions about the decades ahead, from vacationing on Mars to banishing all diseases."

Although Kaku finds Kurzweil's predictions ridiculous, his predictions rely on some of the same mechanics Kurzweil relies on. For example, Kaku assumes that materials / commodity prices will tank in the then-near future because the advance of technology will make raw materials less important and Kaku also assumes the performance and cost scaling of computer chips would continue on the historical path it was on during the 70s and 80s. Like most of the other futurists from Wikipedia's list, Kaku also assumed that the pace of scientific progress would rapidly increase, although his reasons are different (he cites increased synergy between the important fields of quantum mechanics, computer science, and biology, which he says are so important that "it will be difficult to be a research scientist in the future without having some working knowledge of" all of those fields).

Kaku assumed that UV lithography would run out of steam and that we'd have to switch to X-ray or electron lithography, which would then run out of steam, requiring us to switch to a fundamentally different substrate for computers (optical, molecular, or DNA) to keep performance and scaling on track, but advances in other fundamental computing substrates have not materialized quickly enough for Kaku's predictions to come to pass. Kaku assigned very high weight to things that have what he considers "quantum" effects, which is why, for example, he cites the microprocessor as something that will be obsolete by 2020 (they're not "quantum") whereas fiber optics will not be obsolete (they rely on "quantum" mechanisms). Although Kaku pans other futurists for making predictions without having a real understanding of the topics they're discussing, it's not clear that Kaku has a better understanding of many of the topics being discussed even though, as a physicist, Kaku has more relevant background knowledge.

The combination of assumptions above that didn't pan out leads to a fairly low accuracy rate for Kaku's predictions in Visions.

I didn't finish Visions, but the prediction accuracy rate of the part of the book I read (from the beginning until somewhere in the middle, to avoid cherry picking) was 3% (arguably 6% if you give full credit to the prediction I gave half credit to). He made quite a few predictions I didn't score in which he said something "may" happen. Such a prediction is, of course, unfalsifiable because the statement is true whether or not the event happens.

John Naisbitt

Anyone who's a regular used book store bargain bin shopper will have seen this name on the cover of Megatrends, which must be up there with Lee Iacocca's autobiography as one of the most common bargain bin fillers.

Naisbitt claims that he's able to accurately predict the future using "content analysis" of newspapers, which he says was used to provide great insights during WWII and has been widely used by the intelligence community since then, but hadn't been commercially applied until he did it. Naisbitt explains that this works because there's a fixed amount of space in newspapers (apparently newspapers can't be created or destroyed nor can newspapers decide to print significantly more or less news or have editorial shifts in what they decide to print that are not reflected by identical changes in society at large):

Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.

Unfortunately, it's not really possible to judge Naisbitt's predictions because he almost exclusively deals in vague, horoscope-like, predictions which can't really be judged as correct or incorrect. If you just read Megatrends for the flavor of each chapter and don't try to pick out individual predictions, some chapters seem quite good, e.g., "Industrial Society -> Information Society", but some are decidedly mixed even if you very generously grade his vague predictions, e.g., "From Forced Technology to High Tech / High Touch". This can't really be compared to the other futurists in this post because it's much easier to make vague predictions sound roughly correct than to make precise predictions correct but, even so, if reading for general feel of what direction the future might go, Naisbitt's predictions are much more on the mark than any other futurists discussed.

That being said, as far as I read in his book, the one concrete prediction I could find was incorrect, so if you want to score Naisbitt comparably to the other futurists discussed here, you might say his accuracy rate is 0% but with very wide error bars.

Gerard K. O'Neill

O'Neill has two relatively well-known non-fiction futurist books, 2081 and The Technology Edge. 2081 was written in 1980 and predicts the future 100 years from then. The Technology Edge discusses what O'Neill thought the U.S. needed to do in 1983 to avoid being obsoleted by Japan.

O'Neill spends a lot more space on discussing why previous futurists were wrong than any other futurist under discussion. O'Neill notes that "most [futurists] overestimated how much the world would be transformed by social and political change and underestimated the forces of technological change" and cites Kipling, Verne, Wells, Haldane, and Ballamy, as people who did this. O'Neill also says that "scientists tend to overestimate the chances for major scientific breakthroughs and underestimate the effects of straightforward developments well within the boundaries of existing knowledge" and cites Haldane again on this one. O'Neill also cites spaceflight as a major miss of futurists past, saying that they tended to underestimate how quickly spaceflight was going to develop.

O'Neill also says that it's possible to predict the future without knowing the exact mechanism by which the change will occur. For example, he claims that the automobile could've been safely predicted even if the internal combustion engine hadn't been invented because steam would've also worked. But he also goes on to say that there are things it would've been unreasonable to predict, like the radio, TV, and electronic communications, saying that even though the foundations for those were discovered in 1865 and that the time interval between a foundational discovery and its application is "usually quite long", citing 30-50 years from quantum mechanics to integrated circuits and 100+ years from relativity to faster than light travel, and 50+ years from the invention of nuclear power without "a profound impact".

I don't think O'Neill ever really explains why his predictions are of the "automobile" kind in a convincing way. Instead, he relies on doing the opposite of what he sees as mistakes others made. The result is that he predicts huge advancements in space flight, saying we should expect we should expect large scale space travel and colonization by 2081, presaged by wireless transmission of energy by 2000 (referring to energy beamed down from satellites) and interstellar probes by 2025 (presumably something of a different class than the Voyager probes, which were sent out in 1977).

In 1981, he said "a fleet of reusable vehicles of 1990s vintage, numbering much less than today's world fleet of commercial jet transports, would be quite enough to provide transport into space and back again for several hundred million people per year", predicting that something much more advanced the the NASA Space Shuttle would be produced shortly afterwards. Continuing that progress "by the year 2010 or thereabouts there will be many space colonies in existence and many new ones being constructed each year".

Most of O'Neill's predictions are for 2081, but he does make the occasional prediction for a time before 1981. All of the falsifiable ones I could find were incorrect, giving him an accuracy rate of approximately 0% but with fairly wide error bars.

Patrick Dixon

Dixon is best known for writing Futurewise, but he has quite a few books with predictions about the future. In this post, we're just going to look at Futurewise, because it's the most prediction-oriented book Dixon has that's old enough that we ought to be able to make a call on a decent number of his predictions (Futurewise is from 1998; his other obvious candidate, The Future of Almost Everything is from 2015 and looks forward a century).

Unlike most other futurists featured in this post, Dixon doesn't explicitly lay out why you should trust his predictions in Futurewise in the book itself, although he sort of implicitly does so in the acknowledgements, where he mentions having interacted with many very important people.

I am indebted to the hundreds of senior executives who have shaped this book by their participation in presentations on the Six Faces of the Future. The content has been forged in the realities of their own experience.

And although he doesn't explicitly refer to himself, he also says that business success will come from listening to folks who have great vision:

Those who are often right will make a fortune. Trend hunting in the future will be a far cry from the seventies or eighties, when everything was more certain. In a globalized market there are too many variables for back-projection and forward-projection to work reliably .. That's why economists don't make good futurologists when it comes to new technologies, and why so many boards of large corporations are in such a mess when it comes to quantum leaps in thinking beyond 2000.

Second millennial thinking will never get us there ... A senior board member of a Fortune 1000 company told me recently: 'I'm glad I'm retiring so I don't have to face these decisions' ... 'What can we do?' another senior executive declares ...

Later, in The Future of Almost Everything, Dixon lays out the techniques that he says worked when he wrote Futurewise, which "has stood the test of time for more than 17 years". Dixon says:

All reliable, long-range forecasting is based on powerful megatrends that have been driving profound, consistent and therefore relatively predictable change over the last 30 years. Such trends are the basis of every well- constructed corporate strategy and government policy ... These wider trends have been obvious to most trend analysts like myself for a while, and have been well described over the last 20–30 years. They have evolved much more slowly than booms and busts, or social fads.

And lays out trends such as:

Dixon declines to mention trends he predicted that didn't come to pass (such as his prediction that increased tribalism will mean that most new wealth is created in small firms of 20 or fewer employees which will mostly be family owned, or his prediction that the death of "old economics" means that we'll be able to have high economic growth with low unemployment and no inflationary pressure indefinitely), or cases where the trend progression caused Dixon's prediction to be wildly incorrect, a common problem when making predictions off of exponential trends because a relatively small inaccuracy in the rate of change can result in a very large change in the final state.

Dixon's website is full of endorsements for him, with implicit and explicit claims that he's a great predictor of the future, as well as more general statements such as "Patrick Dixon has been ranked as one of the 20 most influential business thinkers alive today".

Back in Futurewise, Dixon relies heavily on the idea of a stark divide between "second millennial thinking" and "third millennial thinking" repeatedly comes up in Dixon's text. Like nearly everyone else under discussion, Dixon also extrapolates out from many existing trends to make predictions that didn't pan out, e.g., he looked at the falling cost and decreasing price of phone lines and predicted that people would end up with a huge number of phone lines in their home by 2005 and that screens getting thinner would mean that we'd have "paper-thin display sheets" in significant use by 2005. This kind of extrapolation sometimes works and Dixon's overall accuracy rate of 10% is quite good compared to the other "futurists" under discussion here.

However, when Dixon explains his reasoning in areas I have some understanding of, he seems to be operating at the buzzword level, so that when he makes a correct call, it's generally for the wrong reasons. For example, Dixon says that software will always be buggy, which seems true, at least to date. However, his reasoning for this is that new computers come out so frequently (he says "less than 20 months" — a reference to the 18 month timeline in Moore's law) and it takes so long to write good software ("at least 20 years") that programmers will always be too busy rewriting software to run on the new generation of machines (due to the age of the book, he uses the example of "brand new code ... written for Pentium chips").

It's simply not the case that most bugs or even, as a fraction of bugs, almost any bugs are due to programmers rewriting existing code to run on new CPUs. If you really squint, you can see things like Android devices having lots of security bugs due to the difficulty of updating Android and backporting changes to older hardware, but those kinds of bugs are both a small fraction of all bugs and not really what Dixon was talking about.

Similarly, on how computer backups will be done in the future, Dixon basically correctly says that home workers will be vulnerable to data loss and people who are serious about saving data will back up data online, "back up data on-line to computers in other cities as the ultimate security".

But Dixon's stated reason for this is that workstations already have large disk capacity (>= 2GB) and floppy disks haven't kept up (< 2MB), so it would take thousands of floppy disks to do backups, which is clearly absurd. However, even at the time, Zip drives (100MB per portable disk) were common and, although it didn't take off, the same company that made Zip drives also made 1GB "Jaz" drives. And, of course, tape backup was also used at the time and is still used today. This trend has continued to this day; large, portable, disks are available, and quite a few people I know transfer or back up large amounts of data on portable disks. The reason most people don't do disk/tape backups isn't that it would require thousands of disks to backup a local computer (if you look at the computers people typically use at home, most people could back up their data onto a single portable disk per failure domain and even keep multiple versions on one disk), but that online/cloud backups are more convenient.

Since Dixon's reasoning was incorrect (at least in the cases where I'm close enough to the topic to understand how applicable the reasoning was), it seems that when Dixon is correct, it can't be for the stated reason and Dixon is either correct by coincidence or because he's looking at the broader trend and came up with an incorrect rationalization for the prediction. But, per the above, it's very difficult to actually correctly predict the growth rate of a trend over time, so without some understanding of the mechanics in play, one could also say that a prediction that comes true based on some rough trend is also correct by coincidence.

Alvin Toffler / Heidi Toffler

Like most others on this list, Toffler claims some big prediction wins

The Tofflers claimed on their website to have foretold the breakup of the Soviet Union, the reunification of Germany and the rise of the Asia-Pacific region. He said in the People’s Daily interview that “Future Shock” envisioned cable television, video recording, virtual reality and smaller U.S. families.

In this post, we'll look at Future Shock, Toffler's most famous work, written in 1970.

According to a number of sources, Alvin Toffler's major works were co-authored by Heidi Toffler. In the books themselves, Heidi Toffler is acknowledged as someone who helped out a lot, but not as an author, despite the remarks elsewhere about co-authorship. In this section, I'm going to refer to Toffler in the singular, but you may want to mentally substitute the plural.

Toffler claims that we should understand the present not only by understanding the past, but also by understanding the future:

Previously, men studied the past to shed light on the present. I have turned the time-mirror around, convinced that a coherent image of the future can also shower us with valuable insights into today. We shall find it increasingly difficult to understand our personal and public problems without making use of the future as an intellectual tool. In the pages ahead, I deliberately exploit this tool to show what it can do.

Toffler generally makes vague, wish-y wash-y statements, so it's not really reasonable to score Toffler's concrete predictions because so few predictions are given. However, Toffler very strongly implies that past exponential trends are expected to continue or even accelerate and that the very rapid change caused by this is going to give rise to "future shock", hence the book's title:

I coined the term "future shock" to describe the shattering stress and disorientation that we induce in individuals by subjecting them to too much change in too short a time. Fascinated by this concept, I spent the next five years visiting scores of universities, research centers, laboratories, and government agencies, reading countless articles and scientific papers and interviewing literally hundreds of experts on different aspects of change, coping behavior, and the future. Nobel prizewinners, hippies, psychiatrists, physicians, businessmen, professional futurists, philosophers, and educators gave voice to their concern over change, their anxieties about adaptation, their fears about the future. I came away from this experience with two disturbing convictions. First, it became clear that future shock is no longer a distantly potential danger, but a real sickness from which increasingly large numbers already suffer. This psycho-biological condition can be described in medical and psychiatric terms. It is the disease of change .. Earnest intellectuals talk bravely about "educating for change" or "preparing people for the future." But we know virtually nothing about how to do it ... The purpose of this book, therefore, is to help us come to terms with the future— to help us cope more effectively with both personal and social change by deepening our understanding of how men respond to it

The big hammer that Toffler uses everywhere is extrapolation of exponential growth, with the implication that this is expected to continue. On the general concept of extrapolating out from curves, Toffler's position is very similar to Kurzweil's: if you can see a trend on a graph, you can use that to predict the future, and the ability of technology to accelerate the development of new technology will cause innovation to happen even more rapidly than you might naively expect:

Plotted on a graph, the line representing progress in the past generation would leap vertically off the page. Whether we examine distances traveled, altitudes reached, minerals mined, or explosive power harnessed, the same accelerative trend is obvious. The pattern, here and in a thousand other statistical series, is absolutely clear and unmistakable. Millennia or centuries go by, and then, in our own times, a sudden bursting of the limits, a fantastic spurt forward. The reason for this is that technology feeds on itself. Technology makes more technology possible, as we can see if we look for a moment at the process of innovation. Technological innovation consists of three stages, linked together into a self-reinforcing cycle. ... Today there is evidence that the time between each of the steps in this cycle has been shortened. Thus it is not merely true, as frequently noted, that 90 percent of all the scientists who ever lived are now alive, and that new scientific discoveries are being made every day. These new ideas are put to work much more quickly than ever before.

The first N major examples of this from the book are:

As we just noted above, when discussing Dixon, Kurzweil, etc., predicting the future by extrapolating out exponential growth is fraught. Toffler somehow manages to pull off the anti-predictive feat of naming a bunch of trends which were about to stop, some of which already had their writing on the wall when Toffler was writing.

Toffler then extrapolates from the above and predicts that the half-life of everything will become shorter, which will overturn how society operates in a variety of ways.

For example, companies and governments will replace bureaucracies with "adhocracies" sometime between 1995 and 2020 . The concern that people will feel like cogs as companies grow larger is obsolete because, in adhocracy, the entire concept of top-down command and control will disappear, obsoleted by the increased pace of everything causing top-down command and control structures to disappear. While it's true that some companies have less top-down direction than would've been expected in Toffler's time, many also have more, which has been enabled by technology allowing employers to keep stricter tabs on employees than ever before, making people more of a cog than ever before.

Another example is that Toffler predicted human colonization of the Ocean, "The New Atlantis", "long before the arrival of A.D. 2000".

Fabian Giesen points out that, independent of the accuracy of Toffler's predictions, Venkatesh Rao's Welcome to the Future Nauseous explains why "future shock" didn't happen in areas of very rapid technological development.

People from the Wikipedia list who weren't included

Steve Yegge

As I mentioned at the start, none of the futurists from Wikipedia's list had very accurate predictions, so we're going to look at a couple other people from other sources who aren't generally considered futurists to see how they rank.

We previously looked at Yegge's predictions here, which were written in 2004 and were generally about the next 5-10 years, with some further out. There were nine predictions (technically ten, but one isn't really a prediction). If grading them as written, which is how futurists have been scored, I would rank these at 4.5/9, or about 50%.

You might argue that this is unfair because Yegge was predicting the relatively near future, but if we look at relatively near future predictions from futurists, their accuracy rate is generally nowhere near 50%, so I don't think it's unfair to compare the number in some way.

If you want to score these like people often score futurists, where they get credit for essentially getting things directionally correct, then I'd say that Yegge's score should be between 7/9 and 8/9, depending on how much partial credit he gets for one of the questions.

If you want to take a more holistic "what would the world look like if Yegge's vision were correct vs. the world we're in today", I think Yegge also does quite well there, with the big miss being that Lisp-based languages have not taken over the world, the success of Clojure notwithstanding. This is quite different than the futurists here, who generally had a vision of many giant changes that didn't come to pass, e.g., if we look at Kurzweil's vision of the world, by 2010, we would've had self-driving cars, a "cure" for paraplegia, widespread use of AR, etc., by 2011, we would have unbounded life expectancy, and by 2019 we would have pervasive use of nanotechnology including computers having switched from transistors to nanotubes, effective "mitigations" for blindness and deafness, fairly widely deployed fully realistic VR that can simulate sex via realistic full-body stimulation, pervasive self-driving cars (predicted again), entirely new fields of art and music, etc., and all that these things imply, which is a very different world than the world we actually live in.

And we see something similar if we look at other futurists, who predicted things like living underground, living under the ocean, etc.; most predicted many revolutionary changes that would really change society, a few of which came to pass. Yegge, instead, predicted quite a few moderate changes (as well as some places where change would be slower than a lot of people expected) and changes were slower than he expected in the areas he predicted, but only by a bit.

Yegge described his methodology for the post above as:

If you read a lot, you'll start to spot trends and undercurrents. You might see people talking more often about some theme or technology that you think is about to take off, or you'll just sense vaguely that some sort of tipping point is occurring in the industry. Or in your company, for that matter.

I seem to have many of my best insights as I'm writing about stuff I already know. It occurred to me that writing about trends that seem obvious and inevitable might help me surface a few not-so-obvious ones. So I decided to make some random predictions based on trends I've noticed, and see what turns up. It's basically a mental exercise in mining for insights

In this essay I'll make ten predictions based on undercurrents I've felt while reading techie stuff this year. As I write this paragraph, I have no idea yet what my ten predictions will be, except for the first one. It's an easy, obvious prediction, just to kick-start the creative thought process. Then I'll just throw out nine more, as they occur to me, and I'll try to justify them even if they sound crazy.

He's not really trying to generate the best predictions, but still did pretty well by relying on his domain knowledge plus some intuition about what he's seen.

In the post about Yegge's predictions, we also noted that he's made quite a few successful predictions outside of his predictions post:

Steve also has a number of posts that aren't explicitly about predictions that, nevertheless, make pretty solid predictions about how things are today, written way back in 2004. There's It's Not Software, which was years ahead of its time about how people write “software”, how writing server apps is really different from writing shrinkwrap software in a way that obsoletes a lot of previously solid advice, like Joel's dictum against rewrites, as well as how service oriented architectures look; the Google at Delphi (again from 2004) correctly predicts the importance of ML and AI as well as Google's very heavy investment in ML; an old interview where he predicts "web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/"; etc. A number of Steve's internal Google blog posts also make interesting predictions, but AFAIK those are confidential.

Quite a few of Yegge's predictions would've been considered fairly non-obvious at the time and he seemed to still have a fairly good success rate on his other predictions (although I didn't try to comprehensively find them and score them, I sampled some of his old posts and found the overall success rate to be similar to the ones in his predictions post).

With Yegge and the other predictors that were picked so that we can look at some accurate predictions there is, of course, a concern that there's survivorship bias in picking these predictors. I suspect that's not the case for Yegge because he continued to be accurate after I first noticed that he seemed to have accurate predictions, so it's not just that I picked someone who had a lucky streak after the fact. Also, especially with some of his Google internal G+ comments, made fairly high dimension comments that ended being right for the reasons he suggested, which provides a lot more information about how accurate his reasoning was than simply winning a bunch of coin flips in a row. This comment about depth of reasoning doesn't apply to Caplan, below, because I haven't evaluated Caplan's reasoning, but does apply to MS leadership circa 1990.

Bryan Caplan

Bryan Caplan reports that his track record is 23/23 = 100%. He is much more precise in specifying his predictions than anyone else we've looked at and tries to give a precise bet that will be trivial to adjudicate as well as betting odds.

Caplan started making predictions/bets around the time the concept that "betting is a tax on bullshit" became popular (the idea being that a lot of people are willing to say anything but will quiet down if asked to make a real bet and those that don't will pay a real cost if they make bad real bets) and Caplan seems to have a strategy as acting as a tax man on bullshit in that he generally takes the safe side of bets that people probably shouldn't have made. Andrew Gelman says:

Caplan’s bets are an interesting mix. The first one is a bet where he offered 1-to-100 odds so it’s no big surprise that he won, but most of them are at even odds. A couple of them he got lucky on (for example, he bet in 2008 that no large country would leave the European Union before January 1, 2020, so he just survived by one month on that one), but, hey, it’s ok to be lucky, and in any case even if he only had won 21 out of 23 bets, that would still be impressive.

It seems to me that Caplan’s trick here is to show good judgment on what pitches to swing at. People come at him with some strong, unrealistic opinions, and he’s been good at crystallizing these into bets. In poker terms, he waits till he has the nuts, or nearly so. 23 out of 23 . . . that’s a great record.

I think there's significant value in doing this, both in the general "betting is a tax on bullshit" sense as well as, more specifically, if you have high belief that someone is trying to take the other side of bad bets and has good judgment, knowing that the Caplan-esque bettor has taken the position gives you decent signal about the bet even if you have no particular expertise in the subject. For example, if you look at my bets, even though I sometimes take bets against obviously wrong positions, I much more frequently take bets I have a very good chance of losing, so just knowing that I took a bet provides much less information than knowing that Caplan took a bet.

But, of course, taking Caplan's side of a bet isn't foolproof. As Gelman noted, Caplan got lucky at least once, and Caplan also seems likely to lose the Caplan and Tabarrok v. Bauman bet on global temperature. For that particular bet, you could also make the case that he's expected to lose since he took the bet with 3:1 odds, but a lot of people would argue that 3:1 isn't nearly long enough odds to take that bet.

The methodology that Caplan has used to date will never result in a positive prediction of some big change until the change is very likely to happen, so this methodology can't really give you a vision of what the future will look like in the way that Yegge or Gates or another relatively accurate predictor who takes wilder bets could.

Bill Gates / Nathan Myhrvold / MS leadership circa 1990 to 1997

A handful of memos that were released to the world due to the case against Microsoft which laid out the vision Microsoft executives had about how the world would develop, with or without Microsoft's involvement. These memos don't lay out concrete predictions with timelines and therefore can't be scored in the same way futurist predictions were scored in this post. If rating these predictions on how accurate their vision of the future was, I'd rate them similarly to Steve Yegge (who scored 7/9 or 8/9), but the predictions were significantly more ambitious, so they seem much more impressive when controlling for the scope of the predictions.

Compared to the futurists we discussed, there are multiple ways in which the predictions are much more detailed (and therefore more impressive for a given level of accuracy on top of being more accurate). One is that MS execs have a much deeper understanding of the things under discussion and how they impact each other. With "our" futurists, they often discuss things at a high level and, when they discuss things in detail, they make statements that make it clear that they don't really understand the topic and often don't really know what the words they're writing mean. MS execs of the era pretty clearly had a deep understanding of the issues in play, which let them make detailed predictions that our futurists wouldn't make, e.g., while protocols like FTP and IRC will continue to be used, the near future of the internet is HTTP over TCP and the browser will become a "platform" in the same way that Windows is a "platform", one that's much more important and larger than any OS (unless Microsoft is successful in taking action to stop this from coming to pass, which it was not despite MS execs foreseeing the exact mechanisms that could cause MS to fail to own the internet). MS execs use this level of understanding to make predictions about the kinds of larger things that our futurists discuss, e.g., the nature of work and how that will change.

Actually having an understanding of the issues in play and not just operating with a typical futurist buzzword level understanding of the topics allowed MS leadership to make fairly good guesses about what the future would look like.

For a fun story about how much effort Gates spent on understanding what was going on, see this story by Joel Spolsky on his first Bill Gates review:

Bill turned to me.

I noticed that there were comments in the margins of my spec. He had read the first page!

He had read the first page of my spec and written little notes in the margin!

Considering that we only got him the spec about 24 hours earlier, he must have read it the night before.

He was asking questions. I was answering them. They were pretty easy, but I can’t for the life of me remember what they were, because I couldn’t stop noticing that he was flipping through the spec…

He was flipping through the spec! [Calm down, what are you a little girl?]

… [ed: ellipses are from the original doc] and THERE WERE NOTES IN ALL THE MARGINS. ON EVERY PAGE OF THE SPEC. HE HAD READ THE WHOLE GODDAMNED THING AND WRITTEN NOTES IN THE MARGINS.

He Read The Whole Thing! [OMG SQUEEE!]

The questions got harder and more detailed.

They seemed a little bit random. By now I was used to thinking of Bill as my buddy. He’s a nice guy! He read my spec! He probably just wants to ask me a few questions about the comments in the margins! I’ll open a bug in the bug tracker for each of his comments and makes sure it gets addressed, pronto!

Finally the killer question.

“I don’t know, you guys,” Bill said, “Is anyone really looking into all the details of how to do this? Like, all those date and time functions. Excel has so many date and time functions. Is Basic going to have the same functions? Will they all work the same way?”

“Yes,” I said, “except for January and February, 1900.”

Silence. ... “OK. Well, good work,” said Bill. He took his marked up copy of the spec ... and left

Gates (and some other MS execs) were very well informed about what was going on to a fairly high level of detail considering all of the big picture concerns they also had in mind.

A topic for another post is how MS leadership had a more effective vision for the future than leadership at old-line competitors (Novell, IBM, AT&T, Yahoo, Sun, etc.) and how this resulted in MS turning into a $2T company while their competitors became, at best, irrelevant and most didn't even succeed at becoming irrelevant and ceased to exist. Reading through old MS memos, it's clear that MS really kept tabs on what competitors were doing and they were often surprised at how ineffective leadership was at their competitors, e.g., on Novell, Bill Gates says "Our traditional competitors are just getting involved with the Internet. Novell is surprisingly absent given the importance of networking to their position"; Gates noted that Frankenberg, then-CEO of Novell, seemed to understand the importance of the internet, but Frankenberg only joined Novell in 1994 and left in 1996 and spent much of his time at Novell reversing the direction the company had taken under Noorda, which didn't leave Novell with a coherent position or plan when Frankenberg "resigned" two years into the pivot he was leading.

In many ways, a discussion of what tech execs at the time thought the future would look like and what paths would lead to success is more interesting than looking at futurists who basically don't understand the topics they're talking about, but I started this post to look at how well futurists understood the topics they discussed and I didn't know, in advance, that their understanding of the topics they discuss and resultant prediction accuracy would be so poor.

Common sources of futurist errors

Not learning from mistakes

The futurists we looked at in this post tend to rate themselves quite highly and, after the fact, generally claim credit for being great predictors of the future, so much so that they'll even tell you how you can predict the future accurately. And yet, after scoring them, the most accurate futurist (among the ones who made concrete enough predictions that they could be scored) came in at 10% accuracy with generous grading that gave them credit for making predictions that accidentally turned out to be correct when they mispredicted the mechanism by which the prediction would come to pass (a strict reading of many of their predictions would reduce the accuracy because they said that the prediction would happen because of their predicted mechanism, which is false, rendering the prediction false).

There are two tricks that these futurists have used to be able to make such lofty claims. First, many of them make vague predictions and then claim credit if anything vaguely resembling the prediction comes to pass. Second, almost all of them make a lot of predictions and then only tally up the ones that came to pass. One way to look at a 4% accuracy rate is that you really shouldn't rely on that person's predictions. Another way is that, if they made 500 predictions, they're a great predictor because they made 20 accurate predictions. Since almost no one will bother to go through a list of predictions to figure out the overall accuracy when someone does the latter, making a huge number of predictions and then cherry picking the ones that were accurate is a good strategy for becoming a renowned futurist.

But if we want to figure out how to make accurate predictions, we'll have to look at other people's strategies. There are people who do make fairly good, generally directionally accurate, predictions, as we noted when we looked at Steve Yegge's prediction record. However, they tend to be harsh critics of their predictions, as Steve Yegge was when he reviewed his own prediction record, saying:

I saw the HN thread about Dan Luu's review of this post, and felt people were a little too generous with the scoring.

It's unsurprising that a relatively good predictor of the future scored himself lower than I did because taking a critical eye to your own mistakes and calling yourself out for mistakes that are too small for most people to care about is a great way to improve. We can see in communications from Microsoft leadership as well, e.g., calling themselves out for failing to predict that a lack of backwards compatibility doomed major efforts like OS/2 and LanMan. Doing what most futurists do and focusing on the predictions that worked out without looking at what went wrong isn't such a great way to improve.

Cocktail party understanding

Another thing we see among people who make generally directionally correct predictions, as in the Steve Yegge post mentioned above, Nathan Myhrvold's 1993 "Road Kill on the Information Highway", Bill Gates's 1995 "The Internet Tidal Wave", etc., is that the person making the prediction actually understands the topic. In all of the above examples, it's clear that the author of the document has a fairly strong technical understanding of the topics being predicted and, in the general case, it seems that people who have relatively accurate predictions are really trying to understand the topic, which is in stark contrast to the futurists discussed in this post, almost all of whom display clear signs of having a having a buzzword level understanding2 of the topics they're discussing.

There's a sense in which it isn't too difficult to make correct predictions if you understand the topic and have access to the right data. Before joining a huge megacorp and then watching the future unfold, I thought documents like "Road Kill on the Information Highway" and "The Internet Tidal Wave" were eerily prescient, but once I joined Google in 2013, a lot of trends that weren't obvious from the outside seemed fairly obvious from the inside.

For example, it was obvious that mobile was very important for most classes of applications, so much so that most applications that were going to be successful would be "mobile first" applications where the web app was secondary, if it existed at all, and from the data available internally, this should've been obvious going back at least to 2010. Looking at what people were doing on the outside, quite a few startups in areas where mobile was critical were operating with a 2009 understanding of the future even as late as 2016 and 2017, where they focused on having a web app first and had no mobile app and a web app that was unusable on mobile. Another example of this is that, in 2012, quite a few people at Google independently wanted Google to make very large bets on deep learning. It seemed very obvious that deep learning was going to be a really big deal and that it was worth making a billion dollar investment in a portfolio of hardware that would accelerate Google's deep learning efforts.

This isn't to say that the problem is trivial — many people with access to the same data still generally make incorrect predictions. A famous example is Ballmer's prediction that "There’s no chance that the iPhone is going to get any significant market share. No chance."3 Ballmer and other MS leadership had access to information as good as MS leadership from a decade earlier, but many of their predictions were no better than the futurists we discussed here. And with the deep learning example above, a competitor with the same information at Google totally whiffed and kept whiffing for years, even with the benefit of years of extra information; they're still well behind Google now, a decade later, due to their failure to understand how to enable effective, practical, deep learning R&D.

Assuming high certainty

Another common cause of incorrect predictions was having high certainty. That's a general problem that's magnified when making predictions from looking at past exponential growth and extrapolating to the future both because mispredicting the timing of a large change in exponential growth can have a very large impact and also because relatively small sustained changes in exponential growth can also have a large impact. An example that exposed these weaknesses for a large fraction of our futurists was their interpretation of Moore's law, which many interpreted as a doubling of every good thing and/or halving of every bad thing related to computers every 18 months. That was never what Moore's law predicted in the first place, but it was a common pop-conception of Moore's law. One thing that's illustrative about that is that predictors who were writing in the late 90s and early 00s still made these fantastical Moore's law "based" predictions even though it was such common knowledge that both single-threaded computer performance and Moore's law would face significant challenges that this was taught in undergraduate classes at the time. Any futurist who spent a few minutes talking to an expert in the area or even an undergrad would've seen that there's a high degree of uncertainty about computer performance scaling, but most of the futurists we discuss either don't do that or ignore evidence that would add uncertainty to their narrative4.

As computing power increases, all constant-factor inefficiencies ("uses twice as much RAM", "takes three times as many RISC operations") tend to be ground under the heel of Moore's Law, leaving polynomial and exponentially increasing costs as the sole legitimate areas of concern. Flare, then, is willing to accept any O(C) inefficiency (single, one-time cost), and is willing to accept most O(N) inefficiencies (constant-factor costs), because neither of these costs impacts scalability; Flare programs and program spaces can grow without such costs increasing in relative significance. You can throw hardware at an O(N) problem as N increases; throwing hardware at an O(N**2) problem rapidly becomes prohibitively expensive.

For computer scaling in particular, it would've been possible to make a reasonable prediction about 2022 computers in, say, 2000, but it would've had to have been a prediction about the distribution of outcomes which had a lot of weight on severely reduced performance gains in the future with some weight on a portfolio of possibilities that could've resulted in continued large gains. Someone making such a prediction would've had to, implicitly or explicitly, been familiar with ITRS semiconductor scaling roadmaps of the era as well as recent causes of recent misses (my recollection from reading roadmaps back then was that, in the short term, companies had actually exceeded recent scaling predictions, but via mechanisms that were not expected to be scalable into the future) as well as things that could unexpectedly keep semiconductor scaling on track. Furthermore, such a predictor would also have to be able to evaluate architectural ideas that might have panned out to rule them out or assign them a low probability, such as dataflow processors, the basket of techniques people were working on in order to increase ILP in order an attempt to move from the regime Tjaden and Flynn discussed in their classic 1970 and 1973 papers on ILP to the something closer to the bound discussed by Riseman and Foster in 1972 and later by Nicolau and Fisher in 1984, etc.

Such a prediction would be painstaking work for someone who isn't in the field because of the sheer number of different things that could have impacted computer scaling. Instead of doing this, futurists relied heavily on the pop-understanding they had about semiconductors. Kaku was notable among futurists under discussion for taking seriously the idea that Moore's law wasn't smooth sailing in the future, but he incorrectly decided on when UV/EUV would run out of steam and also incorrectly had high certainty that some kind of more "quantum" technology would save computer performance scaling. Most other futurists who discussed computers used a line reasoning like Kurzweil's, who said that we can predict what will happen with "remarkable precision" due to the existence of "well-defined indexes":

The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them

Another thing to note here is that, even if you correctly predict an exponential curve of something, understanding the implications of that precise fact also requires an understanding of the big picture which was shown by people like Yegge, Gates, and Myhrvold but not by the futurists discussed here. An example of roughly getting a scaling curve right but mispredicting the outcome was Dixon on the number of phone lines people would have in their homes. Dixon at least roughly correctly predicted the declining cost of phone lines but incorrectly predicted that this would result in people having many phone lines in their house despite also believing that digital technologies and cell phones would have much faster uptake than they did. With respect to phones, another missed prediction, one that came from not having an understanding of the mechanism was his prediction that the falling cost of phone calls would mean that tracking phone calls would be so expensive relative to the cost of calls that phone companies wouldn't track individual calls.

For someone who has a bit of understanding about the underlying technology, this is an odd prediction. One reason the prediction seems odd is that the absolute cost of tracking who called whom is very small and the rate at which humans make and receive phone calls is bounded at a relatively low rate, so even if the cost of metadata tracking were very high compared to the cost of the calls themselves, the absolute cost of tracking metadata would still be very low. Another way to look at it would be to look at the number of bits of information transferred during a phone call vs. the number of bits of information necessary to store call metadata and the cost of storing that long enough to bill someone on a per-call basis. Unless medium-term storage became relatively more expensive than network by a mind bogglingly large factor, it wouldn't be possible for this prediction to be true and Dixon also implicitly predicted exponentially falling storage costs via his predictions on the size of available computer storage with a steep enough curve that this criteria shouldn't be satisfied and, if it were to somehow be satisfied, the cost of storage would still be so low as to be negligible.

Panacea thinking

Another common issue is what Waleed Khan calls panacea thinking, where the person assumes that the solution is a panacea that is basically unboundedly great and can solve all problems. We can see this for quite a few futurists who were writing up until the 70s, where many assumed that computers would be able to solve any problem that required thought, computation, or allocation of resources and that resource scarcity would become irrelevant. But it turns out that quite a few problems don't magically get solved because powerful computers exist. For example, the 2008 housing crash created a shortfall of labor for housing construction that only barely got back to historical levels just before covid hit. Having fast computers neither prevented this nor fixed this problem after it happened because the cause of the problem wasn't a shortfall of computational resources. Some other topics to get this treatment are "nanotechnology", "quantum", "accelerating growth" / "decreased development time", etc.

A closely related issue that almost every futurist here fell prey to is only seeing the upside of technological advancements, resulting in a kind of techno utopian view of the future. For example, in 2005, Kurzweil wrote:

The current disadvantages of Web-based commerce (for example, limitations in the ability to directly interact with products and the frequent frustrations of interacting with inflexible menus and forms instead of human personnel) will gradually dissolve as the trends move robustly in favor of the electronic world. By the end of this decade, computers will disappear as distinct physical objects, with displays built in our eyeglasses, and electronics woven in our clothing, providing full-immersion visual virtual reality. Thus, "going to a Web site" will mean entering a virtual-reality environment—at least for the visual and auditory senses—where we can directly interact with products and people, both real and simulated.

Putting aside the bit about how non-VR interfaces about computers would disappear before 2010, it's striking how Kurzweil assumes that technological advancement will mean that corporations make experiences better for consumers instead of providing the same level of experience at a lower cost or a worse experience at an even lower cost.5

Although that example is from Kurzweil, we can see the same techno utopianism in the other authors on Wikipedia's list with the exception of Zerzan, whose predictions I didn't tally up because prediction wasn't really his shtick. For example, a number of other futurists combined panacea thinking with techno utopianism to predict that computers would cause things to operate with basically perfect efficiency without human intervention, allowing people at large to live a life of leisure. Instead, the benefits to the median person in the U.S. are subtle enough that people debate whether or not life has improved at all for the median person. And on the topic of increased efficiency, a number of people predicted an extreme version of just-in-time delivery that humanity hasn't even come close to achieving and described its upsides, but no futurist under discussion mentioned the downsides of a world-wide distributed just-in-time manufacturing system and supply chain, which includes increased fragility and decreased robustness, notably impacted quite a few industries from 2020 through at least 2022 due to covid despite the worldwide system not being anywhere near as just-in-time or fragile as a number of futurists predicted.

Though not discussed here because they weren't on Wikipedia's list of notable futurists, there are pessimistic futurists such as Jaron Lanier and Paul Ehrlich. From a quick informal look at relatively well-known pessimistic futurists, it seems that pessimistic futurists haven't been more accurate than optimistic futurists. Many made predictions that were too vague to score and the ones who didn't tended to predict catastrophic collapse or overly dystopian futures which haven't materialized. Fundamentally, dystopian thinkers made the same mistakes as utopian thinkers. For example, Paul Ehrlich fell prey to the same issues utopian thinkers fell prey to and he still maintains that his discredited book, The Population Bomb, was fundamentally correct, just like utopian futurists who maintain that their discredited work is fundamentally correct.

Ehrlich's 1968 book opened with

The battle to feed all of humanity is over. In the 1970s the world will undergo famines — hundreds of millions of people are going to starve to death in spite of any crash programs embarked upon now. At this late date nothing can prevent a substantial increase in the world death rate, although many lives could be saved through dramatic programs to "stretch" the carrying capacity of the earth by increasing food production. But these programs will only provide a stay of execution unless they are accompanied by determined and successful efforts at population control. Population control is the conscious regulation of the numbers of human beings to meet the needs, not just of individual families, but of society as a whole.

Nothing could be more misleading to our children than our present affluent society. They will inherit a totally different world, a world in which the standards, politics, and economics of the 1960s are dead.

When this didn't come to pass, he did the same thing as many futurists we looked at and moved the dates on his prediction, changing the text in the opening of his book from "1970s" to "1970s and 1980s". Ehrlich then wrote a new book with even more dire predictions in 1990.

And then later, Ehrlich simply denied ever having made predictions, even though anyone who reads his book can plainly see that he makes plenty of statements about the future with no caveats about the statements being hypothetical:

Anne and I have always followed UN population projections as modified by the Population Reference Bureau — so we never made "predictions," even though idiots think we have.

Unfortunately for pessimists, simply swapping the sign bit on panacea thinking doesn't make predictions more accurate.

Evidence free assumptions

Another major source of errors among these futurists was making an instrumental assumption without any supporting evidence for it. A major example of this is Fresco's theory that you can predict the future by starting from people's values and working back from there, but he doesn't seriously engage with the idea of how people's values can be predicted. Since those are pulled from his intuition without being grounded in evidence, starting from people's values creates a level of indirection, but doesn't fundamentally change the problem of predicting what will happen in the future.

Fin

A goal of this project is to look at current predictors to see who's using methods that have historically had a decent accuracy rate, but we're going to save that for a future post. I normally don't like splitting posts up into multiple parts, but since this post is 30k words (the number of words in a small book, and more words than most pop-sci books have once you remove the pop stories) and evaluating futurists is relatively self-contained, we're going to stop with that (well, with a bit of an evaluation of some longtermist analyses that overlap with this post in the appendix)6.

In terms of concrete takeaways, you could consider this post a kind of negative result that supports the very boring idea that you're not going to get very far if you make predictions on topics you don't understand, whereas you might be able to make decent predictions if you have (or gain) a deep expertise of a topic and apply well-honed intuition to predict what might happen. We've looked at, in some detail, a number of common reasoning errors that cause predictions to miss at a high rate and also taken a bit of a look into some things that have worked for creating relatively accurate predictions.

A major caveat about what's worked is that while using high-level techniques that work poorly is a good way to generate poor predictions, using high-level techniques that work well doesn't mean much because the devil is in the details and, as trite as this is to say, you really need to think about things. This is something that people who are serious about looking at data often preach, e.g., you'll see this theme come up on Andrew Gelman's blog as well as in Richard McElreath's Statistical Rethinking. McElreath, in a lecture targeted at social science grad students who don't have a quantitative background, likens statistical methods to a golem. A golem will mindlessly do what you tell it to do, just like statistical techniques. There's no substitute for using your brain to think through whether or not it's reasonable to apply a particular statistical technique in a certain way. People often seem to want to use methods as a talisman to ward off incorrectness, but that doesn't work.

We see this in the longtermist analyses we examine in the appendix which claim to be more accurate than "classical" futurists analyses because they, among other techniques, state probabilities, which the literature on forecasting (e.g., Tetlock's Superforecasting) says that one should do. But the analyses fundamentally use the same techniques as the futurists analyses we looked at here and then add a few things on top that are also things that people who make accurate predictions do. This is backwards. Things like probabilities need to be a core part of modelling, not something added afterwards. This kind of backwards reasoning is a common error when doing data analysis and I would caution readers who think they're safe against errors because their analyses can, at a high level, be described roughly similarly to good analyses7. An obvious example of this would be the Bill Gates review we looked at. Gates asked a lot of questions and scribbled quite a few notes in the margins, but asking a lot of questions and scribbling notes in the margins of docs doesn't automatically cause you to have a good understanding of the situation. This example is so absurd that I don't think anyone even remotely reasonable would question it, but most analyses I see (of the present as well as of the future) make this fundamental error in one way or another and, as Fabian Giesen might say, are cosplaying what a rigorous analysis looks like.

Thanks to nostalgebraist, Arb Research (Misha Yagudin, Gavin Leech), Laurie Tratt, Fabian Giesen, David Turner, Yossi Kreinin, Catherine Olsson, Tim Pote, David Crawshaw, Jesse Luehrs, @TyphonBaalAmmon, Jamie Brandon, Tao L., Hillel Wayne, Qualadore Qualadore, Sophia, Justin Blank, Milosz Danczak, Waleed Khan, Mindy Preston, @ESRogs, Tim Rice, and @s__video for comments/corrections/discussion (and probably some others I forgot because this post is so long and I've gotten so many comments).

Update / correction: an earlier version of this post contained this error, pointed out by ESRogs. Although I don't believe the error impacts the conclusion, I consider it a fairly major error. If we were doing a tech-company style postmortem, that it doesn't significantly impact the conclusion would be included in the "How We Got Lucky" section of the postmortem. In particular, this was a "lucky" error because the error was made when picking out a few examples from a large portfolio of errors to give examples of one predictors errors, so a single incorrect error doesn't change the conclusion since another error could be substituted in and, even if no other error were substituted, the reasoning quality of the reasoning being evaluated still looks quite low. But, incorrect concluding that something is an error could lead to a different conclusion in the case of a predictor who made few or no errors, which is why this was a lucky mistake for me to make.

Appendix: brief notes on Superforecasting

See also, this Tetlock interview with Tyler Cowen if you don't want to read the whole book, although the book is a very quick read because it's written the standard pop-sci style, with a lot of anecdotes/stories.

On the people we looked at vs. the people Tetlock looked at, the predictors we looked at are operating in a very different style from the folks studied in the studies that led to the Superforecasting book. Both futurists and tech leaders were trying to predict a vision for the future whereas superforecasters were asked to answer very specific questions.

Another major difference among the accurate predictors is that the accurate predatictors we looked at (other than Caplan) had very deep expertise in their fields. This may be one reason for the difference in timelines here, where it appears that some of our predictors can predict things more than 3-5 years out, contra Tetlock's assertion. Another difference is in the kind of thing being predicted — a lot of the predictions we're looking at here are fundamentally whether or not a trend will continue or if a nascent trend will become a long-running trend, which seems easier than a lot of the questions Tetlock had his forecasters try to answer. For example, in the opening of Superforecasting, Tetlock gives predicting the Arab Spring as an example of something that would've been practically impossible — while the conditions for it had been there for years, the proximal cause of the Arab Spring was a series of coincidences that would've been impossible to predict. This is quite different from and arguably much more difficult than someone in 1980 guessing that computers will continue to get smaller and faster, leading to handheld computers more powerful than supercomputers from the 80s.

Appendix: other evaluations

Of these, the evaluations above, the only intersection with the futurists evaluated here is Kurzweil. Holden Karnofsky says:

A 2013 project assessed Ray Kurzweil's 1999 predictions about 2009, and a 2020 followup assessed his 1999 predictions about 2019. Kurzweil is known for being interesting at the time rather than being right with hindsight, and a large number of predictions were found and scored, so I consider this study to have similar advantages to the above study. ... Kurzweil is notorious for his very bold and contrarian predictions, and I'm overall inclined to call his track record something between "mediocre" and "fine" - too aggressive overall, but with some notable hits

Karnofsky's evaluation of Kurzweil being "fine" to "mediocre" relies on these two analyses done on LessWrong and then uses a very generous interpretation of the results to conclude that Kurzweil's predictions are fine. Those two posts rate predictions as true, weakly true, cannot decide, weakly false, or false. Karnofsky then compares the number of true + weakly true to false + weakly false, which is one level of rounding up to get an optimistic result; another way to look at it is that any level other than "true" is false when read as written. This issue is magnified if you actually look at the data and methodology used in the LW analyses.

In the second post, the author, Stuart Armstrong indirectly noted that there were actually no predictions that were, by strong consensus, very true when he noted that the "most true" prediction had a mean score of 1.3 (1 = true, 2 = weakly true ... , 5 = false) and the second highest rated prediction had a mean score of 1.4. Although Armstrong doesn't note this in the post, if you look at the data, you'll see that the third "most true" prediction had a mean score of 1.45 and the fourth had a mean score of 1.6, i.e., if you round to the nearest prediction score, only 3 out of 105 predictions score "true" and 32 are >= 4.5 and score "false". Karnofsky reads Armstrong's as scoring 12% of predictions true, but the post effectively makes no comment on what fraction of predictions were scored true and the 12% came from summing up the total number of each rating given.

I'm not going to say that taking the mean of each question is the only way one could aggregate the numbers (taking the median or modal values could also be argued for, as well as some more sophisticated scoring function, an extremizing function, etc.), but summing up all of the votes across all questions results in a nonsensical number that shouldn't be used for almost anything. If every rater rated every prediction or there was a systematic interleaving of who rated what questions, then the number could be used for something (though not as a score for what fraction of predictions are accurate), but since each rater could skip any questions (although people were instructed to start rating at the first question and rate all questions until they stop, people did not do that and skipped arbitrary questions), aggregating the number of each score given is not meaningful and actually gives very little insight into what fraction of questions are true. There's an air of rigor about all of this; there are lots of numbers, standard deviations are discussed, etc., but the way most people, including Karnofsky, interpret the numbers in the post is incorrect. I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but as we've noted previously, almost no one actually reads what they're citing.

Coming back to Karnofsky's rating of Kurzweil as fine to mediocre, this relies on two levels of rounding. One, doing the wrong kind of aggregation on the raw data to round an accuracy of perhaps 3% up to 12% and then rounding up again by doing the comparison mentioned above instead of looking at the number of true statements. If we use a strict reading and look at the 3%, the numbers aren't so different from what we see in this post. If we look at Armstrong's other post, there are too few raters to really produce any kind of meaningful aggregation. Armstrong rated every prediction, one person rated 68% of predictions, and no one else even rated half of the 172 predictions. The 8 predictors rated 506 predictions, so the number of ratings is equivalent to having 3 raters rate all predictions, but the results are much noisier due to the arbitrary way people decided to pick predictions. This issue is much worse for the 2009 predictions than the 2019 predictions due to the smaller number of raters combined with the sparseness of most raters, making this data set fairly low fidelity; if you want to make a simple inference from the 2019 data, you're probably best off using Armstrong's ratings and discarding the rest (there are non-simple analyses one could do, but if you're going to do that, you might as well just rate the predictions yourself).

Another fundamental issue with the analysis is that it relies on aggregating votes of from a population that's heavily drawn from Less Wrong readers and the associated community. As we discussed here, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of thought (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful? I often see people refer to the high-level "wisdom of crowds" idea, but if we look at the specific statements endorsed by online crowds, we can see that these crowds are often not so wise. In the Arb Research evaluation (discussed below), they get around this problem by checking reviewing answers themselves and also offering a bounty for incorrectly graded predictions, which is one way to deal with having untrustworthy raters, but Armstrong's work has no mitigation for this issue.

On the Karnofsky / Arb Research evaluation, Karnofsky appears to use a less strict scoring than I do and once again optimistically "rounds up". The Arb Research report scores each question as "unambiguously wrong", "ambiguous or near miss", or "unambiguously right" but Karnofsky's scoring removes the ambiguous and near miss results, whereas my scoring only removes the ambiguous results, the idea being that a near miss is still a miss. Accounting for those reduces the scores substantially but still leaves Heinlen, Clarke, and Asimov with significantly higher scores than the futurists discussed in the body of this post. For the rest, many of the predictions that were scored as "unambiguously right" are ones I would've declined to rate for similar reasons to predictions which I declined to rate (e.g., a prediction that something "may well" happen was rated as "unambiguously right" and I would consider that unfalsifiable and therefore not include it). There are also quite a few "unambiguously right" predictions that I would rate as incorrect using a strict reading similar to the readings that you can see below in the detailed appendix.

Another place where Karnofsky rounds up is that Arb research notes that 'The predictions are usually very vague. Almost none take the form “By Year X technology Y will pass on metric Z”'. This makes the prediction accuracy from futurists Arb Research looked at not comparable to precise predictions of the kind Caplan or Karnofsky himself makes, but Karnofsky directly uses those numbers to justify why his own predictions are accurate without noting that the numbers are not comparable. Since the non-comparable numbers were already rounded up, there are two levels of rounding here (more on this later).

As noted above, some of the predictions are ones that I wouldn't rate because I don't see where the prediction is, such as this one (this is the "exact text" of the prediction being scored, according to the Arb Research spreadsheet), which was scored "unambiguously right"

application of computer technology to professional sports be counterproduc- tive? Would the public become less interested in sports or in betting on the outcome if matters became more predictable? Or would there always be enough unpredictability to keep interest high? And would people derive particular excitement from beat ing the computer when low-ranking players on a particular team suddenly started

This seems like a series of questions about something that might happen, but wouldn't be false if none of these happened, so would not count as a prediction in my book.

Similarly, I would not have rated the following prediction, which Arb also scored "unambiguously right"

its potential is often realized in ways that seem miraculous, not because of idealism but because of the practical benefits to society. Thus, the computer's ability to foster human creativity may well be utilized to its fullest, not because it would be a wonderful thing but because it will serve important social functions Moreover, we are already moving in the

Another kind of prediction that was sometimes scored "unambiguously correct" that I declined to score were predictions of the form "this trend that's in progress will become somewhat {bigger / more important}, such as the following:

The consequences of human irresponsibility in terms of waste and pollution will become more apparent and unbearable with time and again, attempts to deal with this will become more strenuous. It is to be hoped that by 2019, advances in technology will place tools in our hands that will help accelerate the process whereby the deterioration of the environment will be reversed.

On Karnofsky's larger point, that we should trust longtermist predictions because futurists basically did fine and longtermsists are taking prediction more seriously and trying harder and should therefore generate better prediction, that's really a topic for another post, but I'll briefly discuss here because of the high intersection with this post. There are two main pillars of this argument. First, that futurists basically did fine which, as we've seen, relies on a considerable amount of rounding up. And second, that the methodologies that longtermists are using today are considerably more effective than what futurists did in the past.

Karnofsky says that the futurists he looked at "collect casual predictions - no probabilities given, little-to-no reasoning given, no apparent attempt to collect evidence and weigh arguments", whereas Karnofsky's summaries use (among other things):

We've seen, when evaluating futurists with an eye towards evaluating longtermists, Karnofsky heavily rounds up in the same way Kurzweil and other futurists do, to paint the picture they want to create. There's also the matter of his summary of a report on Kurzweil's predictions being incorrect because he didn't notice the author of that report used a methodology that produced nonsense numbers that were favorable to the conclusion that Karnofsky favors. It's true that Karnofsky and the reports he cites do the superficial things that the forecasting literature notes is associated with more accurate predictions, like stating probabilities. But for this to work, the probabilities need to come from understanding the data. If you take a pile of data, incorrectly interpret it and then round up the interpretation further to support a particular conclusion, throwing a probability on it at the end is not likely to make it accurate. Although he doesn't use these words, a key thing Tetlock notes in his work is that people who round things up or down to conform to a particular agenda produce low accuracy predictions. Since Karnofsky's errors and rounding heavily lean in one direction, that seems to be happening here.

We can see this in other analyses as well. Although digging into material other than futurist predictions is outside of the scope of this post, nostalgebraist has done this and he said (in a private communication that he gave me permission to mention) that Karnofsky's summary of https://openphilanthropy.org/research/could-advanced-ai-drive-explosive-economic-growth/ is substantially more optimistic about AI timelines than the underlying report in that there's at least one major concern raised in the report that's not brought up as a "con" in Karnofsky's summary and nostalgebraist later wrote this post, where he (implicitly) notes that the methodology used in a report he examined in detail is fundamentally not so different than what the futurists we discussed used. There are quite a few things that may make the report appear credible (it's hundreds of pages of research, there's a complex model, etc.), but when it comes down to it, the model boils down to a few simple variables. In particular, a huge fraction of the variance of whether or not TAI is likely or not likely comes down to the amount of improvement will occur in terms of hardware cost, particularly FLOPS/$. The output of the model can range from 34% to 88% depending how much improvement we get in FLOPS/$ after 2025. Putting in arbitrarily large FLOPS/$ amounts into the model, i.e., the scenario where infinite computational power is free (since other dimensions, like storage and network aren't in the model, let's assume that FLOPS/$ is a proxy for those as well), only pushes the probability of TAI up to 88%, which I would rate as too pessimistic, although it's hard to have a good intuition about what would actually happen if infinite computational power were on tap for free. Conversely, with no performance improvement in computers, the probability of TAI is 34%, which I would rate as overly optimistic without a strong case for it. But I'm just some random person who doesn't work in AI risk and hasn't thought about too much, so your guess on this is as good as mine (and likely better if you're the equivalent of Yegge or Gates and work in the area).

The part about all of this that makes this fundamentally the same thing that the futurists here did is that the estimate of the FLOPS/$ which is instrumental for this prediction is pulled from thin air by someone who is not a deep expert in semiconductors, computer architecture, or a related field that might inform this estimate.

As Karnofsky notes, a number of things were done in an attempt to make this estimate reliable ("the authors (and I) have generally done calibration training, and have tried to use the language of probability") but, when you come up with a model where a single variable controls most of the variances and the estimate for that variable is picked out of thin air, all of the modeling work actually reduces my confidence in the estimate. If you say that, based on your intuition, you think there's some significant probability of TAI by 2100; 10% or 50% or 80% or whatever number you want, I'd say that sounds plausible (why not? things are improving quickly and may continue to do so) but wouldn't place any particular faith in the estimate. If you build a model where the output hinges on a relatively small number of variables and then say that there's an 80% chance, based a critical variable out of thin air, should that estimate be more or less confidence inspiring than the estimate based solely on intuition? I don't think the answer should be that output is higher confidence. The direct guess of 80% is at least honest about its uncertainty. In the model-based case, since the model doesn't propagate uncertainties and the choice of a high but uncertain number can cause the model to output a fairly certain number, like 88%, there's a disconnect between the actual uncertainty produced by the model and the probability estimate.

At one point, in summarizing the report, Karnofsky says

I consider the "evolution" analysis to be very conservative, because machine learning is capable of much faster progress than the sort of trial-and-error associated with natural selection. Even if one believes in something along the lines of "Human brains reason in unique ways, unmatched and unmatchable by a modern-day AI," it seems that whatever is unique about human brains should be re-discoverable if one is able to essentially re-run the whole history of natural selection. And even this very conservative analysis estimates a ~50% chance of transformative AI by 2100

But it seems very strong to call this a "very conservative" estimate when the estimate implicitly relies on future FLOPS/$ improvement staying above some arbitrary, unsupported, threshold. In the appendix of the report itself, it's estimated that there will be a 6 order of magnitude (OOM) improvement and that a 4 OOM improvement would be considered conservative, but why should we expect that 6 OOM is the amount of headroom left for hardware improvement and 4 OOM is some kind of conservative goal that we'll very likely reach? Given how instrumental these estimates are to the output of the model, there's a sense in which the uncertainty of the final estimate has to be at least as large as the uncertainty of these estimates multiplied by their impact on the model but that can't be the case here given the lack of evidence or justification for these inputs to the model.

More generally, the whole methodology is backwards — if you have deep knowledge of a topic, then it can be valuable to put a number down to convey the certainty of your knowledge to other people, and if you don't have deep knowledge but are trying to understand an area, then it can be valuable to state your uncertainties so that you know when you're just guessing. But here, we have a fairly confidently stated estimate (nostalgebraist notes that Karnofsky says "Bio Anchors estimates a >10% chance of transformative AI by 2036, a ~50% chance by 2055, and an ~80% chance by 2100.") that's based off of a model that's nonsense that relies on a variable that's picked out of thin air. Naming a high probability after the fact and then naming a lower number and saying that's conservative when it's based on this kind of modeling is just window dressing. Looking at Karnofsky's comments elsewhere, he lists a number of extremely weak pieces of evidence in support of his position, e.g., in the previous link, he has a laundry list of evidence of mixed strength, including Metaculus, which nostaglebraist has noted is basically worthless for this purpose here and here. It would be very odd for someone who's truth seeking on this particular issue to cite so many bad pieces of evidence; creating a laundry list of such mixed evidence is consistent with someone who has a strong prior belief and is looking for anything that will justify it, no matter how weak. That would also be consistent with the shoddy direct reasoning noted above.

Back to other evaluators, on Justin Rye's evaluations, I would grade the predictions "as written" and therefore more strictly than he did and would end up with lower scores.

For the predictors we looked at in this document who mostly or nearly exclusively give similar predictions, I declined to give them anything like a precise numerical score. To be clear, I think there's value in trying to score vague predictions and near misses, but that's a different thing than this document did, so the scores aren't directly comparable.

A number of people have said that predictions by people who make bold predictions, the way Kurzweil does, are actually pretty good. After all, if someone makes a lot of bold predictions and they're all off by 10 years, that person will have useful insights even if they lose all their bets and get taken to the cleaners in prediction markets. However, that doesn't mean that someone who makes bold predictions should always "get credit for" making bold predictions. For example, in Kurzweil's case, 7% accuracy might not be bad if he uniformly predicted really bold stuff like unbounded life span by 2011. However, that only applies if the hits and misses are both bold predictions, which was not the case in the sampled set of predictions for Kurzweil here. For Kurzweil's predictions evaluated in this document, Kurzweil's correct predictions tended to be very boring, e.g., there will be no giant economic collapse that stops economic growth, cochlear implants will be in widespread use in 2019 (predicted in 1999), etc.

The former is a Caplan-esque bet against people who were making wild predictions that there would be severe or total economic collapse. There's value in bets like that, but it's also not surprising when such a bet is successful. For the latter, the data I could quickly find on cochlear implant rates showed that implant rates slowly linearly increased from the time Kurzweil made the bet until 2019. I would call that a correct prediction, but the prediction is basically just betting that nothing drastically drops cochlear implant rates, making that another Caplan-esque safe bet and not a bet that relies on Kurzweil's ideas about the law of accelerating growth that his wild bets rely on.

If someone makes 40 boring bets of which 7 are right and another person makes 40 boring bets and 22 wild bets and 7 of their boring bets and 0 of their wild bets are right (these are arbitrary numbers as I didn't attempt to classify Kurzweil's bets as wild or not other than the 7 that were scored as correct), do you give the latter person credit for having "a pretty decent accuracy given how wild their bets were"? I would say no.

On the linked HN thread from a particular futurist, a futurist scored themselves 5 out of 10, but most HN commenters scored the same person at 0 out of 10 or, generously, at 1 out of 10, with the general comment that the person and other futurists tend to score themselves much too generously:

sixQuarks: I hate it when “futurists” cherry pick an outlier situation and say their prediction was accurate - like the bartender example.

karaterobot: I wanted to say the same thing. He moved the goal posts from things which "would draw hoots of derision from an audience from the year 2022" to things which there has been some marginal, unevenly distributed, incremental change to in the last 10 years, then said he got it about 50% right. More generally, this is the issue I have with futurists: they get things wrong, and then just keep making more predictions. I suppose that's okay for them to do, unless they try to get people to believe them, and make decisions based on their guesses.

chillacy: Reminded me of the ray [kurzweil] predictions: extremely generous grading.)

Appendix: other reading

Appendix: detailed information on predictions

Ray Kurzweil

4/59 for rated predictions. If you feel like the ones I didn't include that one could arguably include should count, then 7/62.

This list comes from wikipedia's bulleted list of Kurzweil's predictions at the time Peter Diamadis, Kurzweil's co-founder for SingularityU, cited it to bolster the claim that Kurzweil has an 86% prediction accuracy rate. Off the top of my head, this misses quite a few predictions that Kurzweil made, such as life expectancy being "over one hundred" by 2019 and 120 by 2029 (prediction made in 1999) and unbounded (life expectancy increasing at one year per year) by 2011 (prediction made in 2001), that a computer would beat the top human in chess by 2000 (prediction made in 1990).

It's likely that Kurzweil's accuracy rate would change somewhat if we surveyed all of his predictions, but it seems extremely implausible for the rate to hit 86% and, more broadly, looking at Kurzweil's vision of what the world would be like, it also seems impossible that we live in a world that's generally close to Kurzweil's imagined future.

The list above only uses the bulleted predictions from Wikipedia under the section that has per-timeframe sections. If you pull in other ones from the same page that could be evaluated, which includes predictions like " "nanotechnology-based" flying cars would be available [by 2026]", this doesn't hugely change the accuracy rate (and actually can't due to the relatively small number of other predictions).

Jacque Fresco

The foreword to Fresco's book gives a pretty good idea of what to expect from Fresco's predictions:

Looking forward is an imaginative and fascinating book in which the authors take you on a journey into the culture and technology of the twenty-first century. After an introductory section that discusses the "Things that Shape Your Future." you will explore the whys and wherefores of the unfamiliar, alarming, but exciting world of a hundred years from now. You will see this society through the eyes of Scott and Hella, a couple of the next century. Their living quarters are equipped with a cybernator. a seemingly magical computer device, but one that is based on scientific principles now known. It regulates sleeping hours, communications throughout the world, an incredible underwater living complex, and even the daily caloric intake of the "young" couple. (They are in their forties but can expect to live 200 years.) The world that Scott and Hella live in is a world that has achieved full weather control, has developed a finger-sized computer that is implanted in the brain of every baby at birth (and the babies are scientifically incubated the women of the twenty-first century need not go through the pains of childbirth), and that has perfected genetic manipulation that allows the human race to be improved by means of science. Economically, the world is Utopian by our standards. Jobs, wages, and money have long since been phased out. Nothing has a price tag, and personal possessions are not needed. Nationalism has been surpassed, and total disarmament has been achieved; educational technology has made schools and teachers obsolete. The children learn by doing, and are independent in this friendly world by the time they are five.

The chief source of this greater society is the Correlation Center, "Corcen," a gigantic complex of computers that serves but never enslaves mankind. Corcen regulates production, communication, transportation and all other burdensome and monotonous tasks of the past. This frees men and women to achieve creative challenging experiences rather than empty lives of meaningless leisure. Obviously this book is speculative, but it is soundly based upon scientific developments that are now known

As mentioned above, Fresco makes the claim that it's possible to predict the future and to do so, one should start with the values people will have in the future. Many predictions are about "the 21st century" so can arguably be defended as still potentially accurate, although the way the book talks about the stark divide between "the 20th century" and "the 21st century", we should have already seen the changes mentioned in the book since we're no longer in "the 20th century" and the book makes no reference to a long period of transition in between. Fresco does make some specific statements about things that will happen by particular dates, which are covered later. For "the 21st century", his predictions from the first section of his book are:

As mentioned above, the next part of Fresco's prediction is about how science will work. He writes about how "the scientific method" is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply "the scientific method" in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have "a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking".

This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures:

The success of the method of science in solving almost every problem put to it will give individuals in the twenty-first century a deep confidence in its effectiveness. They will not be afraid to experiment with new ways of feeling, thinking, and acting, for they will have observed the self-corrective aspect of science. Science gives us the latest word, not the last word. They will know that if they try something new in personal or social life, the happiness it yields can be determined after sufficient experience has accumulated. They will adapt to changes in a relaxed way as they zigzag toward the achievement of their values. They will know that there are better ways of doing things than have been used in the past, and they will be determined to experiment until they have found them. They will know that most of the unhappiness of human beings in the mid-twentieth century was not due to the lack of shiny new gadgets; it was due, in part, to not using the scientific method to check out new political and social structures that could have yielded greater happiness for them

After discussing, at a high level, the implications on people and society, Fresco gets into specifics, saying that doing everything with computers, what Fresco calls a "cybernated" society, could be achieved by 1979, giving everyone a post-tax income of $100k/yr in 1969 dollars (about $800k/yr in 2022 dollars):

How would you like to have a guaranteed life income of $100,000 per year—with no taxes? And how would you like to earn this income by working a three-hour day, one day per week, for a five-year period of your life, providing you have a six-months vacation each year? Sound fantastic? Not at all with modern technology. This is not twenty-first-century pie-in-the-sky. It could probably be achieved in ten years in the United States if we applied everything we now know about automation and computers to produce a cybernated society. It probably won't be done this rapidly, for it would take some modern thinking applied in an intelligent crash program. Such a crash program was launched to develop the atomic bomb in a little over four years.

Other predictions about "cybernation":

Michio Kaku

That gives a prediction rate of 3%. I stopped reading at this point, so may have missed a number of correct predictions. But, even if the rest of the book was full of correct predictions, the correct prediction rate is likely to be low.

There were also a variety of predictions that I didn't include because they were statements that were true in the present. For example

If the dirt road of the Internet is made up of copper wires, then the paved information highway will probably be made of laser ber optics. Lasers are the perfect quantum device, an instrument which creates beams of coherent light (light beams which vibrate in exact synchronization with each other). This exotic form of light, which does not occur naturally in the universe, is made possible by manipulating the electrons making quantum jumps between orbits within an atom

This doesn't seem like much of a prediction since, when the book was written, the "information highway" already used a lot of fiber. Throughout the book, there's a lot of mysticism around quantum-ness which is, for example, on display above and cited as a reason that microprocesses will become obsolete by 2020 (they're not "quantum") and fiber optics won't (it's quantum):

John Naisbitt

Here are a few quotes that get at the methodology of Naisbitt's hit book, Megatrends:

For the past fifteen years, I have been working with major American corporations to try to understand what is really happening in the United States by monitoring local events and behavior, because collectively what is going on locally is what is going on in America.

Despite the conceits of New York and Washington, almost nothing starts there.

In the course of my work, 1 have been overwhelmingly impressed with the extent to which America is a bottom-up society, that is, where new trends and ideas begin in cities and local communities—for example, Tampa, Hartford, San Diego, Seattle, and Denver, not New York City or Washington, D.C. My colleagues and I have studied this great country by reading its local newspapers. We have discovered that trends are generated from the bottom up, fads from the top down. The findings in this book are based on an analysis of more than 2 million local articles about local events in the cities and towns of this country during a twelve-year period.

Out of such highly localized data bases, I have watched the general outlines of a new society slowly emerge.

We learn about this society through a method called content analysis, which has its roots in World War II. During that war, intelligence experts sought to find a method for obtaining the kinds of information on enemy nations that public opinion polls would have normally provided.

Under the leadership of Paul Lazarsfeld and Harold Lasswell, later to become well-known communication theorists, it was decided that we would do an analysis of the content of the German newspapers, which we could get—although some days after publication. The strain on Germany's people, industry, and economy be- gan to show up in its newspapers, even though information about the country's supplies, production, transportation, and food situation remained secret. Over time, it was possible to piece together what was going on in Germany and to figure out whether conditions were improving or deteriorating by carefully tracking local stories about factory openings, clos- ings, and production targets, about train arrivals, departures, and delays, and so on. ... Although this method of monitoring public behavior and events continues to be the choice of the intelligence community—the United States annually spends millions of dollars in newspaper content analysis in various parts of the world it has rarely been applied commercially. In fact. The Naisbitt Group is the first, and presently the only, organization to utilize this approach in analyzing our society.

Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.

In this forced-choice situation, societies add new preoccupations and forget old ones. In keeping track of the ones that are added and the ones that are given up, we are in a sense measuring the changing share of the market that competing societal concerns command.

Evidently, societies are like human beings. A person can keep only so many problems and concerns in his or her head or heart at any one time. If new problems or concerns are introduced, some existing ones are given up. All of this is reflected in the collective news hole that becomes a mechanical representation of society sorting out its priorities.

Naisbitt rarely makes falsifiable predictions. For example, on the "information society", Naisbitt says

In our new information society, the time orientation is to the future. This is one of the reasons we are so interested in it. We must now learn from the present how to anticipate the future. When we can do that, we will understand that a trend is not destiny; we will be able to learn from the future the way we have been learning from the past.

This change in time orientation accounts for the growing popular and professional interest in the future during the 1970s. For example, the number of universities offering some type of futures-oriented degree has increased from 2 in 1969 to over 45 in 1978. Membership in the World Future Society grew from 200 in 1967 to well over 30,000 in 1982, and the number of popular and professional periodicals devoted to un- derstanding or studying the future has dramatically increased from 12 in 1965 to more than 122 in 1978.

This could be summed up as "in the future, people will think more about the future". Pretty much any case one might make that Naisbitt's claims ended up being true or false could be argued against.

In the chapter on the "information society", one of the most specific predictions is

New information technologies will at first be applied to old industrial tasks, then, gradually, give birth to new activities, processes, and products.

I'd say that this is false in the general case, but it's vague enough that you could argue it's true.

A, rare, falsifiable comment is this prediction about the price of computers

The home computer explosion is upon us. soon to be followed by a software implosion to fuel it. It is projected that by the year 2000, the cost of a home computer system (computer, printer, monitor, modem, and so forth) should only be about that of the present telephone-radio-recorder-television system.

From a quick search, it seems that reference devices cost something like $300 in 1982? That would be $535 in 2000, which wasn't really a reasonable price for a computer as well as the peripherals mentioned and implied by "and so forth".

Gerard K. O'Neill

We discussed O'Neill's predictions on space colonization in the body of this post. This section contains a bit on his other predictions.

On computers, O'Neill says that in 2081 "any major central computer will have rapid access to at least a hundred million million words of memory (the number '1' followed by 14 zeros). A computer of that memory will be no larger than a suitcase. It will be fast enough to carry out a complete operation in more more time than it takes light to travel from this page to your eye, and perhaps a tenth of that time", which is saying that a machine will have 100TWords of RAM or, to round things up simply, let's say 1PB of RAM and a clock speed of something between 300 MHz and 6 GHz, depending on how far away from your face you hold a book.

On other topics, O'Neill predicts we'll have fully automated manufacturing, people will use 6 times as much energy per capita in 2081 as in 1980, pollution other than carbon dioxide will be a solved problem, coal plants will still be used, most (50% to 95%) of energy will be renewable (with the caveat that "ground-based solar" is a "myth" that can never work, and that wind, tide, and hydro are all forms of solar that even combined with geothermal thrown in, can't reasonably provide enough energy), that solar power from satellites is the answer to then-current and future energy needs.

In The Technology Edge, O'Neill makes predictions for the 10 years following the book's publication in 1983. O'Neill says "the book is primarily based on interviews with chief executives". It was written at a time when many Americans were concerned about the impending Japanese dominance of the world. O'Neill says

As an American, I cannot help being angry — not at the Japanese for succeeding, but at the forces of timidity, shortsightedness, greed, laziness and misdirection here in America that have mired us down so badly in recent years, sapped our strength and kept us from equal achievements.

As we will see, opportunities exist now for the opening of whole new industries that can become even greater than those we have lost to the Japanese. Are we to delay and lose those too?

In an interview about the book, O'Neill said

microengineering, robotics, genetic engineering, magnetic flight, family aircraft, and space science. If the U.S. does not compete successfully in these areas, he warns, it will lose the technological and economic leadership it has enjoyed.

This seems like a big miss with both serious false positives as well as false negatives. O'Neill failed to cite industries that ended up being important to the then-continued U.S. dominance of the world economy, e.g, software, and also predicted that space and flight were much more important than they turned out to be.

On the specific mechanism, O'Neill also generally misses, e.g., in the book, O'Neill cites the lack of U.S. PhD production and people heading directly into industry as a reason the U.S. was falling behind and would continue to fall behind Japan, but in a number of important industries, like software, a lot of the major economic/business contributions have been made by people going to industry without a PhD. The U.S. didn't need to massively increase PhD production in the decades following 1983 to stay economically competitive.

There's quite a bit of text dedicated to a commonly discussed phenomenon at the time, how Japanese companies are going to wipe the floor with American and European companies because they're able to make and execute long-term plans, unlike American companies. I'll admit that it's a bit of a mystery to me how short-term thinking has worked so well for American companies and I would've, at least to date.

Patrick Dixon

Dixon opens with:

The next millennium will witness the greatest challenges to human survival ever in human history, and many of them will face us in the early years of its first century ...

The future has six faces, each of which will have a dramatic effect on all of us in the third millennium ... [Fast, Urban, Radical, Universal, Tribal, Ethical, which spells out FUTURE]

Out of these six faces cascade over 500 key expectations, specific predictions as logical workings-out of these important global trends. These range from inevitable to high probability to lower probability — but still significant enough to require strategic planning and personal preparation.

That's the end of the introduction. Some of these predictions are arguably too early to call since, in places, Dixon write as if Futurewise is about the entire "third millenia", but Dixon also notes that drastic changes are expected in the first years and decades of the 21st century and these generally have not come to pass, both the specific cases where Dixon calls out particular timelines or in the cases where Dixon doesn't name a particular timeline. In general, I'm trying to only include predictions where it seems that Dixon is referring to the 2022 timeframe or before, but his general vagueness makes it difficult to make the right call 100% of the time.

The next chapter is titled "Fast" and is about the first of the six "faces" of the future.

This marks the end of the "Fast" chapter. From having skimmed the rest of the book, the hit rate isn't really higher later nor is the style of reasoning any different, so I'm going to avoid doing a prediction-by-prediction grading. Instead, I'll just mention a few highlights (some quite accurate, but mostly not; not included in the prediction accuracy rate since I didn't ensure consistent or random sampling):

Overall accuracy, 8/79 = 10%

Toffler

Intro to Future Shock:

Another reservation has to do with the verb "will." No serious futurist deals in "predictions." These are left for television oracles and newspaper astrologers. ... Yet to enter every appropriate qualification in a book of this kind would be to bury the reader under an avalanche of maybes. Rather than do this, I have taken the liberty of speaking firmly, without hesitation, trusting that the intelligent reader will understand the stylistic problem. The word "will" should always be read as though it were preceded by "probably" or "in my opinion." Similarly, all dates applied to future events need to be taken with a grain of judgment.

[Chapter 1 is about how future shock is going to be a big deal in the future and how we're presently undergoing a revolution]

Despite the disclaimer in the intro, there are very few concrete predictions. The first that I can see is in the middle of chapter two and isn't even really a prediction, but is a statement that very weakly implies world population growth will continue at the same pace or accelerate. Chapter 1 has a lot of vague statements about how severe future shock will be, and then Chapter 2 discusses how the world is changing at an unprecedented rate and cite a population doubling time eleven years to note how much this must change the world since it would require the equivalent of a new Tokyo, Hamburg, Rome, and Rangoon in eleven years, illustrating how shockingly rapid the world is changing. There's a nod to the creation of future subterranean cities, but stated weakly enough that it can't really be called a prediction.

There's a similar implicit prediction that economic growth will continue with a doubling time of fifteen years, meaning that by the time someone is thirty, the amount of stuff (and it's phrased as amount of stuff and not wealth) will have quadrupled and then by the time someone is seventy it will have increased by a factor of thirty two. This is a stronger implicit prediction than the previous one since the phrasing implies this growth rate should continue for at least seventy years and is perhaps the first actual prediction in the book.

Another such prediction appears later in the chapter, on the speed of travel, which took millions of years to reach 100 mph in the 1880s, only fifty-eight years to reach 400 mph in 1938, and then twenty to double again, and then not much more time before rockets could propel people at 4000 mph and people circled the earth at 18000 mph. Strictly speaking, no prediction is made as to the speed of travel in the future, but since the two chapters are about how this increased rate of change will, in the future, cause future shock, citing examples where exponential growth is expected to level off as reasons the future is going to cause future shock would be silly and implicit in the citation is that the speed of travel will continue to grow.

Toffler then goes on to cite a series of examples where, at previous times in history, the time between having an idea and applying the idea was large, shrinking as we get closer to the present, where it's very low because "we have, with the passage of time, invented all sorts of social devices to hasten the process".

Through Chapter 4, Toffler continued to avoid making concrete, specific predictions, but also implied that buildings would be more temporary and, in the United States specifically, there would be an increase in tearing down old buildings (e.g., ten year old apartment buildings) to build new ones because new buildings would be so much better than old ones that it wouldn't make sense to live in old buildings, and that schools will move to using temporary buildings that are quickly dismantled after they're no longer necessary, perhaps often using geodesic domes.

Also, a general increase in modularity, which parts of buildings being swapped out to allow more rapid changes during the short, 25-year life, of a modern building.

Another implied prediction is that everything will be rented instead of owned, with specific examples cited of cars and homes, with an extremely rapid growth in the rate of car rentership over ownership continuing through the 70s in the then-near future.

Through Chapter 5, Toffler continued to avoid making specific predictions, but very strongly implies that the amount of travel people will do for mundane tasks such as committing will hugely increase, making location essentially irrelevant. As with previous implied predictions, this is based on a very rapid increase in what Toffler views as a trend and is implicitly a prediction of the then very near future, citing people who commute 50k miles in a year and 120 miles in a day and citing stats showing that miles traveled have been increasing. When it comes to an actual prediction, Toffler makes the vague comment

among those I have characterized as "the people of the future," commuting, traveling, and regularly relocating one's family have become second nature.

Which, if read very strictly, not technically not a prediction about the future, although it can be implied that people in the future will commute and travel much more.

In a similar implicit prediction, Toffler implies that, in the future, corporations will order highly skilled workers to move to whatever location most benefits the corporation and they'll have no choice but to obey if they want to have a career.

In Chapter 6, in a rare concrete prediction, Toffler writes

When asked "What do you do?" the super-industrial man will label himself not in terms of his present (transient) job, but in terms of his trajectory type, the overall pattern of his work life.

Some obsolete example job types that Toffler presents are "machine operator", "sales clerk", and "computer programmer". Implicit in this section is that career changes will be so rapid and so frequent that the concept of being "a computer programmer" will be meaningless in the future. It's also implied that the half-life of knowledge will be so short in the future that people will no longer accumulate useful knowledge over the course of their career in the future and people, especially in management, shouldn't expect to move up with age and may be expected to move down with age as their knowledge becomes obsolete and they end up in "simpler" jobs.

It's also implied that more people will work for temp agencies, replacing what would previously have been full-time roles. The book is highly U.S. centric and, in the book, this is considered positive for workers (it will give people more flexibility) without mentioning any of the downsides (lack of benefits, etc.). The chapter has some actual explicit predictions about how people will connect to family and friends, but the predictions are vague enough that it's difficult to say if the prediction has been satisfied or not.

In chapter 7, Toffler says that bureaucracies will be replaced by "adhocracies". Where bureaucracies had top down power and put people into well-defined roles, in adhocracies, roles will change so frequently that people won't get stuck into defined roles.. Toffler notes that a concern some people have about the future is that, since organizations will get larger and more powerful, people will feel like cogs, but this concern is unwarranted because adhocracy will replace bureaucracy. This will also mean an end to top-down direction because the rapid pace of innovation in the future won't leave time for any top down decision making, giving workers power. Furthermore, computers will automate all mundane and routine work, leaving no more need for bureaucracy because bureaucracy will only be needed to control large groups of people doing routine work and has no place in non-routine work. It's implied that "in the next twenty-five to fifty years [we will] participate in the end of bureaucracy". As Toffler was writing in 1970, his timeframe for that prediction is 1995 to 2020.

Chapter 8 takes the theme of everything being quicker and turns it to culture. Toffler predicts that celebrities, politicians, sports stars, famous fictional characters, best selling books, pieces of art, knowledge, etc., will all have much shorter careers and/or durations of relevance in the future. Also, new, widely used, words will be coined more rapidly than in the past.

Chapter 9 takes the theme of everything accelerating and notes that social structures and governments are poised to break down under the pressure of rapid change, as evidenced by unrest in Berlin, New York, Turin, Tokyo, Washington, and Chicago. It's possible this is what Toffler is using to take credit for predicting the fall of the Soviet Union?

Under the subheading "The New Atlantis", Toffler predicts an intense race to own the bottom of the ocean and the associated marine life there, with entire new industries springing up to process the ocean's output. "Aquaculture" will be as important as "agriculture", new textiles, drugs, etc., will come from the ocean. This will be a new frontier, akin to the American frontier, people will colonize the ocean. Toffler says "If all this sounds too far off it is sobering to note that Dr. Walter L. Robb, a scientist at General Electric has already kept a hamster alive under water by enclosing it in a box that is, in effect, an artificial gill--a synthetic membrane that extracts air from the surrounding water while keeping the water out." Toffler gives the timeline for ocean colonization as "long before the arrival of A.D. 2000".

Toffler also predicts control over the weather starting in the 70s, that "It is clearly only a matter of years" before women are able to birth children "without the discomfort of pregnancy".

I stopped reading at this point because the chapters all seem very similar to each other, applying the same reasoning to different areas and the rate of accuracy of predictions didn't seem likely to increase in later chapters.


  1. I used web.archive.org to pull an older list because the current list of futurists is far too long for people to evaluate. I clicked on an arbitrary time in the past on archive.org and that list seemed to be short enough to evaluate (though, given the length of this post, perhaps that's not really true) and then looked at those futurists. [return]
  2. While there are cases where people can make great predictions or otherwise show off expertise while making "cocktail party idea" level statements because it's possible to have a finely honed intuition without being able to verbalize the intuition, developing that kind of intuition requires taking negative feedback seriously in order to train your intuition, which is the opposite of what we observed with the futurists discussed in this post. [return]
  3. Ballmer is laughing with incredulity when he says this; $500 is too expensive for phone and will be the most expensive phone by far; a phone without a keyboard won't appeal to business users and won't be useful for writing emails; you can get "great" Windows Phone devices like the Motorola QPhone for $100, which will do everything (messaging, email, etc.), etc.

    You can see these kinds of futurist-caliber predictions all over the place in big companies. For example, on internal G+ at Google, Steve Yegge made a number of quite accurate predictions about what would happen with various major components of Google, such as Google cloud. If you read comments from people who are fairly senior, many disagreed with Yegge for reasons that I would say were fairly transparently bad at the time and were later proven to be incorrect by events. There's a sense in which you can say this means that what's going to happen isn't so obvious even with the right information, but this really depends on what you mean by obvious.

    A kind of anti-easter egg in Tetlock's Superforecasting is that Tetlock makes the "smart contrarian" case that the Ballmer quote is unjustly attacked since worldwide iPhone marketshare isn't all that high and he also claims that Ballmer is making a fairly measured statement that's been taken out of context, which seems plausible if you read the book and look at the out of context quote Tetlock uses but is obviously untrue if you watch the interview the quote comes from. Tetlock has mentioned that he's not a superforecaster and has basically said that he doesn't have the patience necessary to be one, so I don't hold this against him, but I do find it a bit funny that this bogus Freakonomics-style contrarian "refutation" is in this book that discusses, at great length, how important it is to understand the topic you're discussing.

    [return]
  4. Although this is really a topic for another post, I'll note that longtermists not only often operate with the same level of certainty, but also on the exact same topics, e.g., in 2001, noted longetermist Eliezer Yudkowsky said the following in a document describing Flare, his new programming language:

    A new programming language has to be really good to survive. A new language needs to represent a quantum leap just to be in the game. Well, we're going to be up-front about this: Flare is really good. There are concepts in Flare that have never been seen before. We expect to be able to solve problems in Flare that cannot realistically be solved in any other language. ... Back in the good old days, it may have made sense to write "efficient" programming languages. This, however, is a new age. The age of microwave ovens and instant coffee. The age of six-month-old companies, twenty-two-year-old CEOs and Moore's Law. The age of fiber optics. The age of speed. ... "Efficiency" is the property that determines how much hardware you need, and "scalability" is the property that determines whether you can throw more hardware resources at the problem. In extreme cases, lack of scalability may defeat some problems entirely; for example, any program built around 32-bit pointers may not be able to scale at all past 4GB of memory space. Such a lack of scalability forces programmer efforts to be spent on efficiency - on doing more and more with the mere 4GB of memory available. Had the hardware and software been scalable, however, more RAM could have been bought; this is not necessarily cheap but it is usually cheaper than buying another programmer. ... Scalability also determines how well a program or a language ages with time. Imposing a hard limit of 640K on memory or 4GB on disk drives may not seem absurd when the decision is made, but the inexorable progress of Moore's Law and its corollaries inevitably bumps up against such limits. ... Flare is a language built around the philosophy that it is acceptable to sacrifice efficiency in favor of scalability. What is important is not squeezing every last scrap of performance out of current hardware, but rather preserving the ability to throw hardware at the problem. As long as scalability is preserved, it is also acceptable for Flare to do complex, MIPsucking things in order to make things easier for the programmer. In the dawn days of computing, most computing tasks ran up against the limit of available hardware, and so it was necessary to spend a lot of time on optimizing efficiency just to make computing a bearable experience. Today, most simple programs will run pretty quickly (instantly, from the user's perspective), whether written in a fast language or a slow language. If a program is slow, the limiting factor is likely to be memory bandwidth, disk access, or Internet operations, rather than RAM usage or CPU load. ... Scalability often comes at a cost in efficiency. Writing a program that can be parallelized traditionally comes at a cost in memory barrier instructions and acquisition of synchronization locks. For small N, O(N) or O(N**2) solutions are sometimes faster than the scalable O(C) or O(N) solutions. A two-way linked list allows for constant-time insertion or deletion, but at a cost in RAM, and at the cost of making the list more awkward (O(N) instead of O(C) or O(log N)) for other operations such as indexed lookup. Tracking Flare's two-way references through a two-way linked list maintained on the target burns RAM to maintain the scalability of adding or deleting a reference. Where only ten references exist, an ordinary vector type would be less complicated and just as fast, or faster. Using a two-way linked list adds complication and takes some additional computing power in the smallest case, and buys back the theoretical capability to scale to thousands or millions of references pointing at a single target... though perhaps for such an extreme case, further complication might be necessary.

    As with the other Moore's law predictions of the era, this is not only wrong in retrospect, it was so obviously wrong that undergraduates were taught why this was wrong.

    [return]
  5. My personal experience is that, as large corporations have gotten more powerful, the customer experience has often gotten significantly worse as I'm further removed from a human who feels empowered to do anything to help me when I run into a real issue. And the only reason my experience can be described as merely significantly worse and not much worse is that I have enough Twitter followers that when I run into a bug that makes a major corporation's product stop working for me entirely (which happened twice in the past year), I can post about it on Twitter and it's likely someone will escalate the issue enough that it will get fixed.

    In 2005, when I interacted with corporations, it was likely that I was either directly interacting with someone who could handle whatever issue I had or that I only needed a single level of escalation to get there. And, in the event that the issue wasn't solvable (which never happened to me, but could happen), the market was fragmented enough that I could just go use another company's product or service. More recently, in the two cases where I had to go resort to getting support via Twitter, one of the products essentially has no peers, so my ability to use any product or service of that kind would have ended if I wasn't able to find a friend of a friend to help me or if I couldn't craft some kind of viral video / blog post / tweet / etc. In the other case, there are two companies in the space, but one is much larger and offers effective service over a wider area, so I would've lost the ability to use an entire class of product or service in many areas with no recourse other than "going viral". There isn't a simple way to quantify whether or not this effect is "larger than" the improvements which have occurred and if, on balance, consumer experiences have improved or regressed, but there are enough complaints about how widespread this kind of thing is that degraded experiences should at least have some weight in the discussion, and Kurzweil assigns them zero weight.

    [return]
  6. If it turns out that longtermists and other current predictors of the future very heavily rely on the same techniques as futurists past, I may not write up the analysis since it will be quite long and I don't think it's very interesting to write up a very long list of obvious blunders. Per the comment above about how this post would've been more interesting if it focused on business leaders, it's a lot more interesting to write up an analysis if there are some people using reasonable methodologies that can be compared and contrasted.

    Conversely, if people predicting the future don't rely on the techniques discussed here at all, then an analysis informed by futurist methods would be a fairly straightforward negative result that could be a short Twitter thread or a very short post. As Catherine Olsson points out, longtermists draw from a variety of intellectual traditions (and I'm not close enough to longtermist culture to personally have an opinion of the relative weights of these traditions):

    Modern 'longtermism' draws on a handful of intellectual traditions, including historical 'futurist' thinking, as well as other influences ranging from academic philosophy of population ethics to Berkeley rationalist culture.

    To the extent that 'longtermists' today are using similar prediction methods to historical 'futurists' in particular, [this post] bodes poorly for longtermists' ability to anticipate technological developments in the coming decades

    If there's a serious "part 2" to this post, we'll look at this idea and others but, for the reasons mentioned above, there may not be much of a "part 2" to this post.

    [return]
  7. This post by nostalgebraist gives another example of this, where metaculus uses Brier scores for scoring, just like Tetlock did for his Superforecasting work. This gives it an air of credibility until you look at what's actually being computed, which is not something that's meaningful to take a Brier score over, meaning the result of using this rigorous, Superforecasting-approved, technique is nonsense; exactly the kind of thing McElreath warns about. [return]

In defense of simple architectures

2022-04-06 08:00:00

Wave is a $1.7B company with 70 engineers1 whose product is a CRUD app that adds and subtracts numbers. In keeping with this, our architecture is a standard CRUD app architecture, a Python monolith on top of Postgres. Starting with a simple architecture and solving problems in simple ways where possible has allowed us to scale to this size while engineers mostly focus on work that delivers value to users.

Stackoverflow scaled up a monolith to good effect (2013 architecture / 2016 architecture), eventually getting acquired for $1.8B. If we look at traffic instead of market cap, Stackoverflow is among the top 100 highest traffic sites on the internet (for many other examples of valuable companies that were built on top of monoliths, see the replies to this Twitter thread. We don’t have a lot of web traffic because we’re a mobile app, but Alexa still puts our website in the top 75k even though our website is basically just a way for people to find the app and most people don’t even find the app through our website).

There are some kinds of applications that have demands that would make a simple monolith on top of a boring database a non-starter but, for most kinds of applications, even at top-100 site levels of traffic, computers are fast enough that high-traffic apps can be served with simple architectures, which can generally be created more cheaply and easily than complex architectures.

Despite the unreasonable effectiveness of simple architectures, most press goes to complex architectures. For example, at a recent generalist tech conference, there were six talks on how to build or deal with side effects of complex, microservice-based, architectures and zero on how one might build out a simple monolith. There were more talks on quantum computing (one) than talks on monoliths (zero). Larger conferences are similar; a recent enterprise-oriented conference in SF had a double-digit number of talks on dealing with the complexity of a sophisticated architecture and zero on how to build a simple monolith. Something that was striking to me the last time I attended that conference is how many attendees who worked at enterprises with low-scale applications that could’ve been built with simple architectures had copied the latest and greatest sophisticated techniques that are popular on the conference circuit and HN.

Our architecture is so simple I’m not even going to bother with an architectural diagram. Instead, I’ll discuss a few boring things we do that help us keep things boring.

We’re currently using boring, synchronous, Python, which means that our server processes block while waiting for I/O, like network requests. We previously tried Eventlet, an async framework that would, in theory, let us get more efficiency out of Python, but ran into so many bugs that we decided the CPU and latency cost of waiting for events wasn’t worth the operational pain we had to take on to deal with Eventlet issues. The are other well-known async frameworks for Python, but users of those at scale often also report significant fallout from using those frameworks at scale. Using synchronous Python is expensive, in the sense that we pay for CPU that does nothing but wait during network requests, but since we’re only handling billions of requests a month (for now), the cost of this is low even when using a slow language, like Python, and paying retail public cloud prices. The cost of our engineering team completely dominates the cost of the systems we operate2.

Rather than take on the complexity of making our monolith async we farm out long-running tasks (that we don’t want responses to block on) to a queue.

A place where we can’t be as boring as we’d like is with our on-prem datacenters. When we were operating solely in Senegal and Côte d'Ivoire, we operated fully in the cloud, but as we expand into Uganda (and more countries in the future), we’re having to split our backend and deploy on-prem to comply with local data residency laws and regulations. That's not exactly a simple operation, but as anyone who's done the same thing with a complex service-oriented architecture knows, this operation is much simpler than it would've been if we had a complex service-oriented architecture.

Another area is with software we’ve had to build (instead of buy). When we started out, we strongly preferred buying software over building it because a team of only a few engineers can’t afford the time cost of building everything. That was the right choice at the time even though the “buy” option generally gives you tools that don’t work. In cases where vendors can’t be convinced to fix showstopping bugs that are critical blockers for us, it does make sense to build more of our own tools and maintain in-house expertise in more areas, in contradiction to the standard advice that a company should only choose to “build” in its core competency. Much of that complexity is complexity that we don’t want to take on, but in some product categories, even after fairly extensive research we haven’t found any vendor that seems likely to provide a product that works for us. To be fair to our vendors, the problem they’d need to solve to deliver a working solution to us is much more complex than the problem we need to solve since our vendors are taking on the complexity of solving a problem for every customer, whereas we only need to solve the problem for one customer, ourselves.

A mistake we made in the first few months of operation that has some cost today was not carefully delimiting the boundaries of database transactions. In Wave’s codebase, the SQLAlchemy database session is a request-global variable; it implicitly begins a new database transaction any time a DB object’s attribute is accessed, and any function in Wave’s codebase can call commit on the session, causing it to commit all pending updates. This makes it difficult to control the time at which database updates occur, which increases our rate of subtle data-integrity bugs, as well as making it harder to lean on the database to build things like idempotency keys or a transactionally-staged job drain. It also increases our risk of accidentally holding open long-running database transactions, which can make schema migrations operationally difficult.

Some choices that we’re unsure about (in that these are things we’re either thinking about changing, or would recommend to other teams starting from scratch to consider a different approach) were using RabbitMQ (for our purposes, Redis would probably work equally well as a task queue and just using Redis would reduce operational burden), using Celery (which is overcomplicated for our use case and has been implicated in several outages e.g. due to backwards compatibility issues during version upgrades), using SQLAlchemy (which makes it hard for developers to understand what database queries their code is going to emit, leading to various situations that are hard to debug and involve unnecessary operational pain, especially related to the above point about database transaction boundaries), and using Python (which was the right initial choice because of our founding CTO’s technical background, but its concurrency support, performance, and extensive dynamism make us question whether it’s the right choice for a large-scale backend codebase). None of these was a major mistake, and for some (e.g. Python) the downsides are minimal enough that it’s cheaper for us to continue to pay the increased maintenance burden than to invest in migrating to something theoretically better, but if we were starting a similar codebase from scratch today we’d think hard about whether they were the right choice.

Some areas where we’re happy with our choices even though they may not sound like the simplest feasible solution is with our API, where we use GraphQL, with our transport protocols, where we had a custom protocol for a while, and our host management, where we use Kubernetes. For our transport protocols, we used to use a custom protocol that runs on top of UDP, with an SMS and USSD fallback, for the performance reasons described in this talk. With the rollout of HTTP/3, we’ve been able to replace our custom protocol with HTTP/3 and we generally only need USSD for events like the recent internet shutdowns in Mali.

As for using GraphQL, we believe the pros outweigh the cons for us:

Pros:

Cons:

As for Kubernetes, we use Kubernetes because knew that, if the business was successful (which it has been) and we kept expanding, we’d eventually expand to countries that require us to operate our services in country. The exact regulations vary by country, but we’re already expanding into one major African market that requires we operate our “primary datacenter” in the country and there are others with regulations that, e.g., require us to be able to fail over to a datacenter in the country.

An area where there’s unavoidable complexity for us is with telecom integrations. In theory, we would use a SaaS SMS provider for everything, but the major SaaS SMS provider doesn’t operate everywhere in Africa and the cost of using them everywhere would be prohibitive3. The earlier comment on how the compensation cost of engineers dominates the cost of our systems wouldn’t be true if we used a SaaS SMS provider for all of our SMS needs; the team that provides telecom integrations pays for itself many times over.

By keeping our application architecture as simple as possible, we can spend our complexity (and headcount) budget in places where there’s complexity that it benefits our business to take on. Taking the idea of doing things as simply as possible unless there’s a strong reason to add complexity has allowed us to build a fairly large business with not all that many engineers despite running an African finance business, which is generally believed to be a tough business to get into, which we’ll discuss in a future post (one of our earliest and most helpful advisers, who gave us advice that was critical in Wave’s success, initially suggested that Wave was a bad business idea and the founders should pick another one because he foresaw so many potential difficulties).

Thanks to Ben Kuhn, Sierra Rotimi-Williams, June Seif, Kamal Marhubi, Ruthie Byers, Lincoln Quirk, Calum Ball, John Hergenroeder, Bill Mill, Sophia Wisdom, and Finbarr Timbers for comments/corrections/discussion.


  1. If you want to compute a ratio, we had closer to 40 engineers when we last fundraised and were valued at $1.7B. [return]
  2. There are business models for which this wouldn't be true, e.g., if we were an ad-supported social media company, the level of traffic we'd need to support our company as it grows would be large enough that we'd incur a significant financial cost if we didn't spend a significant fraction of our engineering time on optimization and cost reduction work. But, as a company that charges real money for a significant fraction of interactions with an app, our computational load per unit of revenue is very low compared to a social media company and it's likely that this will be a minor concern for us until we're well over an order of magnitude larger than we are now; it's not even clear that this would be a major concern if we were two orders of magnitude larger, although it would definitely be a concern at three orders of magnitude growth. [return]
  3. Despite the classic advice about how one shouldn’t compete on price, we (among many other things) do compete on price and therefore must care about costs. We’ve driven down the cost of mobile money in Africa and our competitors have had to slash their prices to match our prices, which we view as a positive value for the world [return]

Why is it so hard to buy things that work well?

2022-03-14 08:00:00

There's a cocktail party version of the efficient markets hypothesis I frequently hear that's basically, "markets enforce efficiency, so it's not possible that a company can have some major inefficiency and survive". We've previously discussed Marc Andreessen's quote that tech hiring can't be inefficient here and here:

Let's launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that's the case. ... No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They're like lying on the beach gasping because they can't get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.

Variants of this idea that I frequently hear engineers and VCs repeat involve companies being efficient and/or products being basically as good as possible because, if it were possible for them to be better, someone would've outcompeted them and done it already1.

There's a vague plausibility to that kind of statement, which is why it's a debate I've often heard come up in casual conversation, where one person will point out some obvious company inefficiency or product error and someone else will respond that, if it's so obvious, someone at the company would have fixed the issue or another company would've come along and won based on being more efficient or better. Talking purely abstractly, it's hard to settle the debate, but things are clearer if we look at some specifics, as in the two examples above about hiring, where we can observe that, whatever abstract arguments people make, inefficiencies persisted for decades.

When it comes to buying products and services, at a personal level, most people I know who've checked the work of people they've hired for things like home renovation or accounting have found grievous errors in the work. Although it's possible to find people who don't do shoddy work, it's generally difficult for someone who isn't an expert in the field to determine if someone is going to do shoddy work in the field. You can try to get better quality by paying more, but once you get out of the very bottom end of the market, it's frequently unclear how to trade money for quality, e.g., my friends and colleagues who've gone with large, brand name, accounting firms have paid much more than people who go with small, local, accountants and gotten a higher error rate; as a strategy, trying expensive local accountants hasn't really fared much better. The good accountants are typically somewhat expensive, but they're generally not charging the highest rates and only a small percentage of somewhat expensive accountants are good.

More generally, in many markets, consumers are uninformed and it's fairly difficult to figure out which products are even half decent, let alone good. When people happen to choose a product or service that's right for them, it's often for the wrong reasons. For example, in my social circles, there have been two waves of people migrating from iPhones to Android phones over the past few years. Both waves happened due to Apple PR snafus which caused a lot of people to think that iPhones were terrible at something when, in fact, they were better at that thing than Android phones. Luckily, iPhones aren't strictly superior to Android phones and many people who switched got a device that was better for them because they were previously using an iPhone due to good Apple PR, causing their errors to cancel out. But, when people are mostly making decisions off of marketing and PR and don't have access to good information, there's no particular reason to think that a product being generally better or even strictly superior will result in that winning and the worse product losing. In capital markets, we don't need all that many informed participants to think that some form of the efficient market hypothesis holds ensuring "prices reflect all available information". It's a truism that published results about market inefficiencies stop being true the moment they're published because people exploit the inefficiency until it disappears. But with the job market examples, even though firms can take advantage of mispriced labor, as Greenspan famously did before becoming Chairman of the fed, inefficiencies can persist:

Townsend-Greenspan was unusual for an economics firm in that the men worked for the women (we had about twenty-five employees in all). My hiring of women economists was not motivated by women's liberation. It just made great business sense. I valued men and women equally, and found that because other employers did not, good women economists were less expensive than men. Hiring women . . . gave Townsend-Greenspan higher-quality work for the same money . . .

But as we also saw, individual firms exploiting mispriced labor have a limited demand for labor and inefficiencies can persist for decades because the firms that are acting on "all available information" don't buy enough labor to move the price of mispriced people to where it would be if most or all firms were acting rationally.

In the abstract, it seems that, with products and services, inefficiencies should also be able to persist for a long time since, similarly, there also isn't a mechanism that allows actors in the system to exploit the inefficiency in a way that directly converts money into more money, and sometimes there isn't really even a mechanism to make almost any money at all. For example, if you observe that it's silly for people to move from iPhones to Android phones because they think that Apple is engaging in nefarious planned obsolescence when Android devices generally become obsolete more quickly, due to a combination of iPhones getting updates for longer and iPhones being faster at every price point they compete at, allowing the phone to be used on bloated sites for longer, you can't really make money off of this observation. This is unlike a mispriced asset that you can buy derivatives of to make money (in expectation).

A common suggestion to the problem of not knowing what product or service is good is to ask an expert in the field or a credentialed person, but this often fails as well. For example, a friend of mine had trouble sleeping because his window air conditioner was loud and would wake him up when it turned on. He asked a trusted friend of his who works on air conditioners if this could be improved by getting a newer air conditioner and his friend said "no; air conditioners are basically all the same". But any consumer who's compared items with motors in them would immediately know that this is false. Engineers have gotten much better at producing quieter devices when holding power and cost constant. My friend eventually bought a newer, quieter, air conditioner, which solved his sleep problem, but he had the problem for longer than he needed to because he assumed that someone whose job it is to work on air conditioners would give him non-terrible advice about air conditioners. If my friend were an expert on air conditioners or had compared the noise levels of otherwise comparable consumer products over time, he could've figured out that he shouldn't trust his friend, but if he had that level of expertise, he wouldn't have needed advice in the first place.

So far, we've looked at the difficulty of getting the right product or service at a personal level, but this problem also exists at the firm level and is often worse because the markets tend to be thinner, with fewer products available as well as opaque, "call us" pricing. Some commonly repeated advice is that firms should focus on their "core competencies" and outsource everything else (e.g., Joel Spolsky, Gene Kim, Will Larson, Camille Fournier, etc., all say this), but if we look mid-sized tech companies, we can see that they often need to have in-house expertise that's far outside what anyone would consider their core competency unless, e.g., every social media company has kernel expertise as a core competency. In principle, firms can outsource this kind of work, but people I know who've relied on outsourcing, e.g., kernel expertise to consultants or application engineers on a support contract, have been very unhappy with the results compared to what they can get by hiring dedicated engineers, both in absolute terms (support frequently doesn't come up with a satisfactory resolution in weeks or months, even when it's one a good engineer could solve in days) and for the money (despite engineers being expensive, large support contracts can often cost more than an engineer while delivering worse service than an engineer).

This problem exists not only for support but also for products a company could buy instead of build. For example, Ben Kuhn, the CTO of Wave, has a Twitter thread about some of the issues we've run into at Wave, with a couple of followups. Ben now believes that one of the big mistakes he made as CTO was not putting much more effort into vendor selection, even when the decision appeared to be a slam dunk, and more strongly considering moving many systems to custom in-house versions sooner. Even after selecting the consensus best product in the space from the leading (as in largest and most respected) firm, and using the main offering the company has, the product often not only doesn't work but, by design, can't work.

For example, we tried "buy" instead of "build" for a product that syncs data from Postgres to Snowflake. Syncing from Postrgres is the main offering (as in the offering with the most customers) from a leading data sync company, and we found that it would lose data, duplicate data, and corrupt data. After digging into it, it turns out that the product has a design that, among other issues, relies on the data source being able to seek backwards on its changelog. But Postgres throws changelogs away once they're consumed, so the Postgres data source can't support this operation. When their product attempts to do this and the operation fails, we end up with the sync getting "stuck", needing manual intervention from the vendor's operator and/or data loss. Since our data is still on Postgres, it's possible to recover from this by doing a full resync, but the data sync product tops out at 5MB/s for reasons that appear to be unknown to them, so a full resync can take days even on databases that aren't all that large. Resyncs will also silently drop and corrupt data, so multiple cycles of full resyncs followed by data integrity checks are sometimes necessary to recover from data corruption, which can take weeks. Despite being widely recommended and the leading product in the space, the product has a number of major design flaws that mean that it literally cannot work.

This isn't so different from Mongo or other products that had fundamental design flaws that caused severe data loss, with the main difference being that, in most areas, there isn't a Kyle Kingsbury who spends years publishing tests on various products in the field, patiently responding to bogus claims about correctness until the PR backlash caused companies in the field to start taking correctness seriously. Without that pressure, most software products basically don't work, hence the Twitter threads from Ben, above, where he notes that the "buy" solutions you might want to choose mostly don't work2. Of course, at our scale, there are many things we're not going to build any time soon, like CPUs, but, for many things where the received wisdom is to "buy", "build" seems like a reasonable option. This is even true for larger companies and building CPUs. Fifteen years ago, high-performance (as in, non-embedded level of performance) CPUs were a canonical example of something it would be considered bonkers to build in-house, absurd for even the largest software companies, but Apple and Amazon have been able to produce best-in-class CPUs on the dimensions they're optimizing for, for predictable reasons3.

This isn't just an issue that impacts tech companies; we see this across many different industries. For example, any company that wants to mail items to customers has to either implement shipping themselves or deal with the fallout of having unreliable shipping. As a user, whether or not packages get shipped to you depends a lot on where you live and what kind of building you live in.

When I've lived in a house, packages have usually arrived regardless of the shipper (although they've often arrived late). But, since moving into apartment buildings, some buildings just don't get deliveries from certain delivery services. Once, I lived in a building where the postal service didn't deliver mail properly and I didn't get a lot of mail (although I frequently got mail addressed to other people in the building as well as people elsewhere). More commonly, UPS and Fedex usually won't attempt to deliver and will just put a bunch of notices up on the building door for all the packages they didn't deliver, where the notice falsely indicates that the person wasn't home and correctly indicates that, to get the package, the person has to go to some pick-up location to get the package.

For a while, I lived in a city where Amazon used 3rd-party commercial courier services to do last-mile shipping for same-day delivery. The services they used were famous for marking things as delivered without delivering the item for days, making "same day" shipping slower than next day or even two day shipping. Once, I naively contacted Amazon support because my package had been marked as delivered but wasn't delivered. Support, using a standard script supplied to them by Amazon, told me that I should contact them again three days after the package was marked as delivered because couriers often mark packages as delivered without delivering them, but they often deliver the package within a few days. Amazon knew that the courier service they were using didn't really even try to deliver packages4 promptly and the only short-term mitigation available to them was to tell support to tell people that they shouldn't expect that packages have arrived when they've been marked as delivered.

Amazon eventually solved this problem by having their own delivery people or using, by commercial shipping standards, an extremely expensive service (Apple has done for same-day delivery)5. At scale, there's no commercial service you can pay for that will reliably attempt to deliver packages. If you want a service that actually works, you're generally on the hook for building it yourself, just like in the software world. My local grocery store tried to outsource this to DoorDash. I've tried delivery 3 times from my grocery store and my groceries have showed up 2 out of 3 times, which is well below what most people would consider an acceptable hit rate for grocery delivery. Having to build instead of buy to get reliability is a huge drag on productivity, especially for smaller companies (e.g., it's not possible for small shops that want to compete with Amazon and mail products to customers to have reliable delivery since they can't build out their own delivery service).

The amount of waste generated by the inability to farm out services is staggering and I've seen it everywhere I've worked. An example from another industry: when I worked at a small chip startup, we had in-house capability to do end-to-end chip processing (with the exception of having its own fabs), which is unusual for a small chip startup. When the first wafer of a new design came off of a fab, we'd have the wafer flown to us on a flight, at which point someone would use a wafer saw to cut the wafer into individual chips so we could start testing ASAP. This was often considered absurd in the same way that it would be considered absurd for a small software startup to manage its own on-prem hardware. After all, the wafer saw and the expertise necessary to go from a wafer to a working chip will be idle over 99% of the time. Having full-time equipment and expertise that you use less than 1% of the time is a classic example of the kind of thing you should outsource, but if you price out having people competent to do this plus having the equipment available to do it, even at fairly low volumes, it's cheaper to do it in-house even if the equipment and expertise for it are idle 99% of the time. More importantly, you'll get much better service (faster turnaround) in house, letting you ship at a higher cadence. I've both worked at companies that have tried to contract this kind of thing out as well as talked with many people who've done that and you get slower, less reliable, service at a higher cost.

Likewise with chip software tooling; despite it being standard to outsource tooling to large EDA vendors, we got a lot of mileage out using our own custom tools, generally created or maintained by one person, e.g., while I was there, most simulator cycles were run on a custom simulator that was maintained by one person, which saved millions a year in simulator costs (standard pricing for a simulator at the time was a few thousand dollars per license per year and we had a farm of about a thousand simulation machines). You might think that, if a single person can create or maintain a tool that's worth millions of dollars a year to the company, our competitors would do the same thing, just like you might think that if you can ship faster and at a lower cost by hiring a person who knows how to crack a wafer open, our competitors would do that, but they mostly didn't.

Joel Spolsky has an old post where he says:

“Find the dependencies — and eliminate them.” When you're working on a really, really good team with great programmers, everybody else's code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.

We had a similar attitude, although I'd say that we were a bit more humble. We didn't think that everyone else was producing garbage but, we also didn't assume that we couldn't produce something comparable to what we could buy for a tenth of the cost. From talking to folks at some competitors, there was a pretty big cultural difference between how we operated and how they operated. It simply didn't occur to them that they didn't have to buy into the standard American business logic that you should focus on your core competencies, that you can think through whether or not it makes sense to do something in-house on the merits of the particular thing instead of outsourcing your thinking to a pithy saying.

I once watched, from the inside, a company undergo this cultural shift. A few people in leadership decided that the company should focus on its core competencies, which meant abandoning custom software for infrastructure. This resulted in quite a few large migrations from custom internal software to SaaS solutions and open source software. If you watched the discussions on "why" various projects should or shouldn't migrate, there were a few unusually unreasonable people who tried to reason through particular cases on the merits of each case (in a post on pushing back against orders from the top, Yossi Kreinin calls these people insane employees; I'm going to refer to the same concept in this post, but instead call people who do this unusually unreasonable). But, for the most part, people bought the party line and pushed for a migration regardless of the specifics.

The thing that I thought was interesting was that leadership didn't tell particular teams they had to migrate and there weren't really negative consequences for teams where an "unusually unreasonable person" pushed back in order to keep running an existing system for reasonable reasons. Instead, people mostly bought into the idea and tried to justify migrations for vaguely plausible sounding reasons that weren't connected to reality, resulting in funny outcomes like moving to an open source system "to save money" when the new system was quite obviously less efficient6 and, predictably, required much higher capex and opex. The cost savings was supposed to come from shrinking the team, but the increase in operational cost dominated the change in the cost of the team and the complexity of operating the system meant that the team size increased instead of decreasing. There were a number of cases where it really did make sense to migrate, but the stated reasons for migration tended to be unrelated or weakly related to the reasons it actually made sense to migrate. Once people absorbed the idea that the company should focus on core competencies, the migrations were driven by the cultural idea and not any technical reasons.

The pervasiveness of decisions like the above, technical decisions made without serious technical consideration, is a major reason that the selection pressure on companies to make good products is so weak. There is some pressure, but it's noisy enough that successful companies often route around making a product that works, like in the Mongo example from above, where Mongo's decision to loudly repeat demonstrably bogus performance claims and making demonstrably false correctness claims was, from a business standpoint, superior to focusing on actual correctness and performance; by focusing their resources where it mattered for the business, they managed to outcompete companies that made the mistake of devoting serious resources to performance and correctness.

Yossi's post about how an unusually unreasonable person can have outsized impact in a dimension they value at their firm also applies to impact outside of a firm. Kyle Kingsbury, mentioned above, is an example of this. At the rates that I've heard Jepsen is charging now, Kyle can bring in what a senior developer at BigCo does (actually senior, not someone with the title "senior"), but that was after years of working long hours at below market rates on an uncertain endeavour, refuting FUD from his critics (if you read the replies to the linked posts or, worse yet, the actual tickets where he's involved in discussions with developers, the replies to Kyle were a constant stream of nonsense for many years, including people working for vendors feeling like he has it out for them in particular, casting aspersions on his character7, and generally trashing him). I have a deep respect for people who are willing to push on issues like this despite the system being aligned against them but, my respect notwithstanding, basically no one is going to do that. A system that requires someone like Kyle to take a stand before successful firms will put effort into correctness instead of correctness marketing is going to produce a lot of products that are good at marketing correctness without really having decent correctness properties (such as the data sync product mentioned in this post, whose website repeatedly mentions how reliable and safe the syncing product is despite having a design that is fundamentally broken).

It's also true at the firm level that it often takes an unusually unreasonable firm to produce a really great product instead of just one that's marketed as great, e.g., Volvo, the one car manufacturer that seemed to try to produce a level of structural safety beyond what could be demonstrated by IIHS tests fared so poorly as a business that it's been forced to move upmarket and became a niche, luxury, automaker since safety isn't something consumers are really interested in despite car accidents being a leading cause of death and a significant source of life expectancy loss. And it's not clear that Volvo will be able to persist in being an unreasonable firm since they weren't able to survive as an independent automaker. When Ford acquired Volvo, Ford started moving Volvos to the shared Ford C1 platform, which didn't fare particularly well in crash tests. Since Geely has acquired Volvo, it's too early to tell for sure if they'll maintain Volvo's commitment to designing for real-world crash data and not just crash data that gets reported in benchmarks. If Geely declines to continue Volvo's commitment to structural safety, it may not be possible to buy a modern car that's designed to be safe.

Most markets are like this, except that there was never an unreasonable firm like Volvo in the first place. On unreasonable employees, Yossi says

Who can, and sometimes does, un-rot the fish from the bottom? An insane employee. Someone who finds the forks, crashes, etc. a personal offence, and will repeatedly risk annoying management by fighting to stop these things. Especially someone who spends their own political capital, hard earned doing things management truly values, on doing work they don't truly value – such a person can keep fighting for a long time. Some people manage to make a career out of it by persisting until management truly changes their mind and rewards them. Whatever the odds of that, the average person cannot comprehend the motivation of someone attempting such a feat.

It's rare that people are willing to expend a significant amount of personal capital to do the right thing, whatever that means to someone, but it's even rarer that the leadership of a firm will make that choice and spend down the firm's capital to do the right thing.

Economists have a term for cases where information asymmetry means that buyers can't tell the difference between good products and "lemons", "a market for lemons", like the car market (where the term lemons comes from), or both sides of the hiring market. In economic discourse, there's a debate over whether cars are a market for lemons at all for a variety of reasons (lemon laws, which allow people to return bad cars, don't appear to have changed how the market operates, very few modern cars are lemons when that's defined as a vehicle with serious reliability problems, etc.). But looking at whether or not people occasionally buy a defective car is missing the forest for the trees. There's maybe one car manufacturer that really seriously tries to make a structurally safe car beyond what standards bodies test (and word on the street is that they skimp on the increasingly important software testing side of things) because consumers can't tell the difference between a more or less safe car beyond the level a few standards bodies test to. That's a market for lemons, as is nearly every other consumer and B2B market.

Appendix: culture

Something I find interesting about American society is how many people think that someone who gets the raw end of a deal because they failed to protect themselves against every contingency "deserves" what happened (orgs that want to be highly effective often avoid this by having a "blameless" culture, but very few people have exposure to such a culture).

Some places I've seen this recently:

If you read these kinds of discussions, you'll often see people claiming "that's just how the world is" and going further and saying that there is no other way the world could be, so anyone who isn't prepared for that is an idiot.

Going back to the laptop theft example, anyone who's traveled, or even read about other cultures, can observe that the things that North Americans think are basically immutable consequences of a large-scale society are arbitrary. For example, if you leave your bag and laptop on a table at a cafe in Korea and come back hours later, the bag and laptop are overwhelmingly likely to be there I've heard this is true in Japan as well. While it's rude to take up a table like that, you're not likely to have your bag and laptop stolen.

And, in fact, if you tweak the context slightly, this is basically true in America. It's not much harder to walk into an empty house and steal things out of the house (it's fairly easy to learn how to pick locks and even easier to just break a window) than it is to steal things out of a cafe. And yet, in most neighbourhoods in America, people are rarely burglarized and when someone posts about being burglarized, they're not excoriated for being a moron for not having kept an eye on their house. Instead, people are mostly sympathetic. It's considered normal to have unattended property stolen in public spaces and not in private spaces, but that's more of a cultural distinction than a technical distinction.

There's a related set of stories Avery Pennarun tells about the culture shock of being an American in Korea. One of them is about some online ordering service you can use that's sort of like Amazon. With Amazon, when you order something, you get a box with multiple bar/QR/other codes on it and, when you open it up, there's another box inside that has at least one other code on it. Of course the other box needs the barcode because it's being shipped through some facility at-scale where no one knows what the box is or where it needs to go and the inner box also had to go through some other kind of process and it also needs to be able to be scanned by a checkout machine if the item is sold at a retailer. Inside the inner box is the item. If you need to return the item, you put the item back into its barcoded box and then put that box into the shipping box and then slap another barcode onto the shipping box and then mail it out.

So, in Korea, there's some service like Amazon where you can order an item and, an hour or two later, you'll hear a knock at your door. When you get to the door, you'll see an unlabeled box or bag and the item is in the unlabeled container. If you want to return the item, you "tell" the app that you want to return the item, put it back into its container, put it in front of your door, and they'll take it back. After seeing this shipping setup, which is wildly different from what you see in the U.S., he asked someone "how is it possible that they don't lose track of which box is which?". The answer he got was, "why would they lose track of which box is which?". His other stories have a similar feel, where he describes something quite alien, asks a local how things can work in this alien way, who can't imagine things working any other way and response with "why would X not work?"

As with the laptop in cafe example, a lot of Avery's stories come down to how there are completely different shared cultural expectations around how people and organizations can work.

Another example of this is with covid. Many of my friends have spent most of the last couple of years in Asian countries like Vietnam or Taiwan, which have had much lower covid rates, so much so that they were barely locked down at all. My friends in those countries were basically able to live normal lives, as if covid didn't exist at all (at least until the latest variants, at which point they were vaccinated and at relatively low risk for the most serious outcomes), while taking basically zero risk of getting covid.

In most western countries, initial public opinion among many people was that locking down was pointless and there was nothing we could do to prevent an explosion of covid. Multiple engineers I know, who understand exponential growth and knew what the implications were, continued normal activities before lockdown and got and (probably) spread covid. When lockdowns were implemented, there was tremendous pressure to lift them as early as possible, resulting in something resembling the "adaptive response" diagram from this post. Since then, many people (I have a project tallying up public opinion on this that I'm not sure I'll ever prioritize enough to complete) have changed their opinion to "having ever locked down was stupid, we were always going to end up with endemic covid, all of this economic damage was pointless". If we look at in-person retail sales data or restaurant data, we can easily see that many people were voluntarily limiting their activities before and after lockdowns in the first year or so of the pandemic when the virus was in broad circulation.

Meanwhile, in some Asian countries, like Taiwan and Vietnam, people mostly complied with lockdowns when they were instituted, which means that they were able to squash covid in the country when outbreaks happened until relatively recently, when covid mutated into forms that spread much more easily and people's tolerance for covid risk went way up due to vaccinations. Of course, covid kept getting reintroduced into countries that were able to squash it because other countries were not, in large part due to the self-fulfilling belief that it would be impossible to squash covid.

Coming back to when it makes sense to bring something in-house, even in cases where it superficially sounds like it shouldn't, because the expertise is 99% idle or a single person would have to be able to build software that a single firm would pay millions of dollars a year for, much of this comes down to whether or not you're in a culture where you can trust another firm's promise. If you operate in a society where it's expected that other firms will push you to the letter of the law with respect to whatever contract you've negotiated, it's frequently not worth the effort to negotiate a contract that would give you service even one half as good as you'd get from someone in house. If you look at how these contracts end up being worded, companies often try to sneak in terms that make the contract meaningless, and even when you managed to stamp out all of that, legally enforcing the contract is expensive and, in the cases I know of where companies regularly violated their agreement for their support SLA (just for example), the resolution was to terminate the contract rather than pursue legal action because the cost of legal action wouldn't be worth anything that could be gained.

If you can't trust other firms, you frequently don't have a choice with respect to bringing things in house if you want them to work.

Although this is really a topic for another post, I'll note that lack of trust that exists across companies can also hamstring companies when it exists internally. As we discussed previously, a lot of larger scale brokenness also comes out of the cultural expectations within organizations. A specific example of this that leads to pervasive organizational problems is lack of trust within the organization. For example, a while back, I was griping to a director that a VP broke a promise and that we were losing a lot of people for similar reasons. The director's response was "there's no way the VP made a promise". When I asked for clarification, the clarification was "unless you get it in a contract, it wasn't a promise", i.e., the rate at which VPs at the company lie is high enough that a verbal commitment from a VP is worthless; only a legally binding commitment that allows you to take them to court has any meaning.

Of course, that's absurd, in that no one could operate at a BigCo while going around and asking for contracts for all their promises since they'd immediately be considered some kind of hyperbureaucratic weirdo. But, let's take the spirit of the comment seriously, that only trust people close to you. That's good advice in the company I worked for but, unfortunately for the company, the implications are similar to the inter-firm example, where we noted that a norm where you need to litigate the letter of the law is expensive enough that firms often bring expertise in house to avoid having to deal with the details. In the intra-firm case and you'll often see teams and orgs "empire build" because they know they, at least the management level, they can't trust anyone outside their fiefdom.

While this intra-firm lack of trust tends to be less costly than the inter-firm lack of trust since there are better levers to get action on an organization that's the cause of a major blocker, it's still fairly costly. Virtually all of the VPs and BigCo tech execs I've talked to are so steeped in the culture they're embedded in that they can't conceive of an alternative, but there isn't an inherent reason that organizations have to work like that. I've worked at two companies where people actually trust leadership and leadership does generally follow through on commitments even when you can't take them to court, including my current employer, Wave. But, at the other companies, the shared expectation that leadership cannot and should not be trusted "causes" the people who end up in leadership roles to be untrustworthy, which results in the inefficiencies we've just discussed.

People often think that having a high degree of internal distrust is inevitable as a company scales, but people I've talked to who were in upper management or fairly close to the top of Intel and Google said that the companies had an extended time period where leadership enforced trustworthiness and that stamping out dishonesty and "bad politics" was a major reason the company was so successful, under Andy Grove and Eric Schmidt, respectively. When the person at the top changed and a new person who didn't enforce honesty came in, the standard cultural norms that you see at the upper levels of most big companies seeped in, but that wasn't inevitable.

When I talk to people who haven't been exposed to BigCo leadership culture and haven't seen how decisions are actually made, they often find the decision making processes to be unbelievable in much the same way that people who are steeped in BigCo leadership culture find the idea that a large company could operate any other way to be unbelievable.

It's often difficult to see how absurd a system is from the inside. Another perspective on this is that Americans often find Japanese universities and the work practices of Japanese engineering firms absurd, though often not as absurd as the promotion policies in Korean chaebols, which are famously nepotistic, e.g., Chung Mong-yong is the CEO of Hyundai Sungwoo because he's the son of Chung Soon-yung, who was the head of Hyundai Sungwoo because he was the younger brother of Chung Ju-yung, the founder of Hyundai Group (essentially the top-level Hyundai corporation), etc. But Japanese and Korean engineering firms are not, in general, less efficient than American engineering firms outside of the software industry despite practices that seem absurdly inefficient to American eyes. American firms didn't lose their dominance in multiple industries while being more efficient; if anything, market inefficiencies allowed them to hang on to marketshare much longer than you would naively expect if you just looked at the technical merit of their products.

There are offsetting inefficiencies in American firms that are just as absurd as effectively having familiar succession of company leadership in Korean chaebols. It's just that the inefficiencies that come out of American cultural practices seem to be immutable facts about the world to people inside the system. But when you look at firms that have completely different cultures, it becomes clear that cultural norms aren't a law of nature.

Appendix: downsides of build

Of course, building instead of buying isn't a panacea. I've frequently seen internal designs that are just as broken as the data sync product described in this post. In general, when you see a design like that, a decent number of people explained why the design can never work during the design phase and were ignored. Although "build" gives you a lot more control than "buy" and gives you better odds of a product that works because you can influence the design, a dysfunctional team in a dysfunctional org can quite easily make products that don't work.

There's a Steve Jobs quote that's about companies that also applies to teams:

It turns out the same thing can happen in technology companies that get monopolies, like IBM or Xerox. If you were a product person at IBM or Xerox, so you make a better copier or computer. So what? When you have monopoly market share, the company's not any more successful.

So the people that can make the company more successful are sales and marketing people, and they end up running the companies. And the product people get driven out of the decision making forums, and the companies forget what it means to make great products. The product sensibility and the product genius that brought them to that monopolistic position gets rotted out by people running these companies that have no conception of a good product versus a bad product.

They have no conception of the craftsmanship that's required to take a good idea and turn it into a good product. And they really have no feeling in their hearts, usually, about wanting to really help the customers."

For "efficiency" reasons, some large companies try to avoid duplicate effort and kill projects if they seem too similar to another project, giving the team that owns the canonical verison of a product a monopoly. If the company doesn't have a culture of trying to do the right thing, this has the same problems that Steve Jobs discusses, but at the team and org level instead of the company level.

The workaround a team I was on used was to basically re-implement a parallel stack of things we relied on that didn't work. But this was only possible beacuse leadership didn't enforce basically anything. Ironically, this was despite their best efforts — leadership made a number of major attempts to impose top-down control, but they didn't understand how to influence an organization, so the attempts failed. Had leadership been successful, the company would've been significantly worse off. There are upsides to effective top-down direction when leadership has good plans, but that wasn't really on the table, so it's actually better that leadership didn't know how to execute.

Thanks to Fabian Giesen, Yossi Kreinen, Peter Bhat Harkins, Ben Kuhn, Laurie Tratt, John Hergenroeder, Tao L., @softminus, Justin Blank, @deadalnix, Dan Lew, @ollyrobot, Sophia Wisdom, Elizabeth Van Nostrand, Kevin Downey, and @PapuaHardyNet for comments/corrections/discussion.


  1. To some, that position is so absurd that it's not believable that anyone would hold that position (in response to my first post that featured the Andreessen quote, above, a number of people told me that it was an exaggerated straw man, which is impossible for a quote, let alone one that sums up a position I've heard quite a few times), but to others, it's an immutable fact about the world. [return]
  2. On the flip side, if we think about things from the vendor side of things, there's little incentive to produce working products since the combination of the fog of war plus making false claims about a product working seems to be roughly as good as making a working product (at least until someone like Kyle Kingsbury comes along, which never happens in most industries), and it's much cheaper.

    And, as Fabian Giesen points out, when vendors actually want to produce good or working products, the fog of war also makes that difficult:

    But producers have a dual problem, which is that all the signal you get from consumers is sporadic, infrequent and highly selected direct communication, as well as a continuous signal of how sales look over time, which is in general very hard to map back to why sales went up or down.

    You hear directly from people who are either very unhappy or very happy, and you might hear second-hand info from your salespeople, but often that's pure noise. E.g. with RAD products over the years a few times we had a prospective customer say, "well we would license it but we really need X" and we didn't have X. And if we heard that 2 or 3 times from different customers, we'd implement X and get back to them a few months later. More often than not, they'd then ask for Y next, and it would become clear over time that they just didn't want to license for some other reason and saying "we need X, it's a deal-breaker for us" for a couple choices of X was just how they chose to get out of the eval without sounding rude or whatever.

    In my experience that's a pretty thorny problem in general, once you spin something out or buy something you're crossing org boundaries and lose most of the ways you otherwise have to cut through the BS and figure out what's actually going on. And whatever communication does happen is often forced to go through a very noisy, low-bandwidth, low-fidelity, high-latency channel.

    [return]
  3. Note that even though it was somewhat predictable that a CPU design team at Apple or Amazon that was well funded had a good chance of being able to produce a best-in-class CPU (e.g., see this 2013 comment about the effectiveness of Apple's team and this 2015 comment about other mobile vendors) that would be a major advantage for their firm, this doesn't mean that the same team should've been expected to succeed if they tried to make a standalone business. In fact, Apple was able to buy their core team cheaply because the team, after many years at DEC and then successfully founding SiByte, founded PA Semi, which basically failed as a business. Similarly, Amazon's big silicon initial hires were from Annapurna (also a failed business that was up for sale because it couldn't survive independently) and Smooth Stone (a startup that failed so badly that it didn't even need to be acquired and people could be picked up individually). Even when there's an obvious market opportunity, factors like network effects, high fixed costs, up front capital expenditures, the ability of incumbent players to use market power to suppress new competitors, etc., can and often does prevent anyone from taking the opportunity. Even though we can now clearly see that there were large opportunities available for the taking, there's every reason to believe that, based on the fates of many other CPU startups to date, an independent startup that attempt to implement the same ideas wouldn't have been nearly a successful and most likely have gone bankrupt or taken a low offer relative to the company's value due to the company's poor business prospects.

    Also, before Amazon started shipping ARM server chips, the most promising ARM server chip, which had pre-orders from at least one major tech company, was killed because it was on the wrong side of an internal political battle.

    The chip situation isn't so different from the motivating example we looked at in our last post, baseball scouting, where many people observed that baseball teams were ignoring simple statistics they could use to their advantage. But, none of the people observing that were in a position to run a baseball team for decades, allowing the market opportunity to persist for decades.

    [return]
  4. Something that amuses me is how some package delivery services appear to apply relatively little effort to make sure that someone even made an attempt to delivery the package. When packages are marked delivered, there's generally a note about how it was delivered, which is frequently quite obviously wrong for the building, e.g., "left with receptionist" for a building with no receptionist or "left on porch" for an office building with no porch and a receptionist who was there during the alleged delivery time. You could imagine services would, like Amazon, request a photo along with "proof of delivery" or perhaps use GPS to check that the driver was plausibly at least in the same neighborhood as the building at the time of delivery, but they generally don't seem to do that?

    I'd guess that a lot of the fake deliveries come from having some kind of quota, one that's difficult or impossible to achieve, combined with weak attempts at verifying that a delivery was done or even attempted.

    [return]
  5. When I say they solved it, I mean that Amazon delivery drivers actually try to deliver the package maybe 95% of the time to the apartment buildings I've lived in, vs. about 25% for UPS and Fedex and much lower for USPS and Canada Post, if we're talking about big packages and not letters. [return]
  6. Very fittingly for this post, I saw an external discussion on this exact thing where someone commented that it must've been quite expensive for the company to switch to the new system due to its known inefficiencies.

    In true cocktail party efficient markets hypothesis form, an internet commenter replied that the company wouldn't have done it if it was inefficient and therefore it must not have been as inefficient as the first commenter thought.

    I suspect I spent more time looking at software TCO than anyone else at the company and the system under discussion was notable for having one of the largest increases in cost of any system at the company without a concomitant increase in load. Unfortunately, the assumption that competition results in good internal decisions is just as false as the assumption that competition results in good external decisions.

    [return]
  7. Note that if you click the link but don't click through to the main article, the person defending Kyle made the original quote seem more benign than it really is out of politeness because he elided the bit where the former Redis developer advocate (now "VP of community" for Zig) said that Jespen is "ultimately not that different from other tech companies, and thus well deserving of boogers and cum". [return]

Misidentifying talent

2022-02-21 08:00:00

[Click to collapse / expand section on sports] Here are some notes from talent scouts:

  • Recruit A:
    • ... will be a real specimen with chance to have a Dave Parker body. Facially looks like Leon Wagner. Good body flexibility. Very large hands.
  • Recruit B:
    • Outstanding physical specimen – big athletic frame with broad shoulders and long, solid arms and leg. Good bounce in his step and above avg body control. Good strong face.
  • Recruit C:
    • Hi butt, longish arms & legs, leanish torso, young colt
    • [different scout]: Wiry loose good agility with good face
    • [another scout]: Athletic looking body, loose, rangy, slightly bow legged.

Out of context, you might think they were scouting actors or models, but these are baseball players ("A" is Lloyd Moseby, "B" is Jim Abbott, and "C" is Derek Jeter), ones that were quite good (Lloyd Moseby was arguably only a very good player for perhaps four years, but that makes him extraordinary compared to most players who are scouted). If you read other baseball scouting reports, you'll see a lot of comments about how someone has a "good face", who they look like, what their butt looks like, etc.

Basically everyone wants to hire talented folks. But even in baseball, where returns to hiring talent are obvious and high and which is the most easily quantified major U.S. sport, people made fairly obvious blunders for a century due to relying on incorrectly honed gut feelings that relied heavily on unconscious as well as conscious biases. Later, we'll look at what baseball hiring means for other fields, but first, let's look at how players who didn't really pan out ended up with similar scouting reports (programmers who don't care about sports can think of this as equivalent to interview feedback) as future superstars, such as the following comments on Adam Eaton, who was a poor player by pro standards despite being considered one of the hottest prospects (potential hires) of his generation:

  • Scout 1: Medium frame/compact/firm. A very good athlete / shows quick "cat-like" reactions. Excellent overall body strength. Medium hands / medium length arms / w strong forearms ... Player is a tough competitor. This guy has some old fashioned bull-dog in his make-up.
  • Scout 2: Good body with frame to develop. Long arms and big hands. Narrow face. Has sideburns and wears hat military style. Slope shoulders. Strong inlegs ... Also played basketball. Good athlete .... Attitude is excellent. Can't see him breaking down. One of the top HS pitchers in the country
  • Scout 3: 6'1"-6'2" 180 solid upper and lower half. Room to pack another 15 without hurting

On the flip side, scouts would also pan players who would later turn out to be great based on their physical appearance, such as these scouts who were concerned about Albert Pujols's weight:

  • Scout 1: Heavy, bulky body. Extra (weight) on lower half. Future (weight) problem. Aggressive hitter with mistake HR power. Tends to be a hacker.
  • Scout 2: Good bat (speed) with very strong hands. Competes well and battles at the plate. Contact seems fair. Swing gets a little long at times. Will over pull. He did not hit the ball hard to CF or RF. Weight will become an issue in time.

Pujols ended up becoming one of the best baseball players of all time (currently ranked 32nd by WAR). His weight wasn't a problem, but if you read scouting reports on other great players who were heavy or short, they were frequently underrated. Of course, baseball scouting reports didn't only look at people's appearances, but scouts were generally highly biased by what they thought an athlete should look like.

Because using stats in baseball has "won" (top teams all employ stables of statisticians nowadays) and "old school" folks don't want to admit this, we often see people saying that using stats doesn't really result in different outcomes than we used to get. But this is so untrue that the examples people give are generally self-refuting. For example, here's what Sports Illustrated had to say on the matter:

Media and Internet draft prognosticators love to play up the “scrappy little battler” aspect with Madrigal, claiming that modern sabermetrics helps scouts include smaller players that were earlier overlooked. Of course, that is hogwash. A players [sic] abilities dictate his appeal to scouts—not height or bulk—and smaller, shorter players have always been a staple of baseball-from Mel Ott to Joe Morgan to Kirby Puckett to Jose Altuve.

These are curious examples to use in support of scouting since Kirby Puckett was famously overlooked by scouts despite putting up statistically dominant performances and was only able to become a baseball player through random happenstance, when the assistant director of the Twins farm system went to watch his own son play in a baseball game his and saw Kirby Puckett in the same game, which led to the Twins drafting Kirby Puckett, who carried the franchise for a decade.

Joe Morgan was also famously overlooked and only managed to become a professional baseball player through random happenstance. Morgan put up statistically dominant numbers in high school, but was ignored due to his height. Because he wasn't drafted by a pro team, he went to Oakland City College, where he once again put up great numbers that were ignored. The reason a team noticed him was a combination of two coincidences. First, a new baseball team was created and that new team needed to fill a team and the associated farm system, which meant that they needed a lot of players. Second, that new baseball team needed to hire scouts and hired Bill Wight (who wasn't previously working as a scout) as a scout. Wight became known for not having the same appearance bias as nearly every other scout and was made fun of for signing "funny looking" baseball players. Bill convinced the new baseball team to "hire" quite a few overlooked players, including Joe Morgan.

Mel Ott was also famously overlooked and only managed to become a professional baseball player through happenstance. He was so dominant in high school that he played for adult semi-pro teams in his spare time. However, when he graduated, pro baseball teams didn't want him because he was too small, so he took a job at a lumber company and played for the company team. The owner of the lumber company was impressed by his baseball skills and, luckily for Ott, the owner of the lumber company was business partners and friends with the owner of a baseball team and effectively got Ott a position on a pro baseball team, resulting in the 20th best baseball career of all time as ranked by WAR1. Most short baseball players probably didn't get a random lucky break; for every one who did, there are likely many who didn't. If we look at how many nearly-ignored-but-lucky players put up numbers that made them all-time greats, it seems likely that the vast majority of the potentially greatest players of all time who played amateur or semi-pro baseball were ignored and did not play professional baseball (if this seems implausible, when reading the upcoming sections on chess, go, and shogi, consider what would happen if you removed all of the players who don't look like they should be great based on what people think makes someone cognitively skilled at major tech companies, and then look at what fraction of all-time-greats remain).

Deciding who to "hire" for a baseball team was a high stakes decision with many millions of dollars (in 2022 dollars) on the line, but rather than attempt to seriously quantify productivity, teams decided who to draft (hire) based on all sorts of irrelevant factors. Like any major sport, baseball productivity is much easier to quantify than in most real-world endeavors since the game is much simpler than "real" problems are. And, among major U.S. sports, baseball is the easiest sport to quantify, but this didn't stop baseball teams from spending a century overindexing on visually obvious criteria such as height and race.

I was reminded of this the other day when, the other day, I saw a thread on Twitter where a very successful person talks about how they got started, saying that they were able to talk their way into an elite institution despite being unqualified and use this story to conclude that elite gatekeepers are basically just scouting for talent and that you just need to show people that you have talent:

One college related example from my life is that I managed to get into CMU with awful grades and awful SAT scores (I had the flu when I took the test :/)

I spent a month learning everything about CMU's CS department, then drove there and talked to professors directly when I first showed up at the campus the entrance office asked my GPA and SAT, then asked me to leave. But I managed to talk to one professor, who sent me to their boss, recursively till I was talking to the vice president of the school he asked me why I'm good enough to go to CMU and I said "I'm not sure I am. All these other kids are really smart. I can leave now" and he interrupted me and reminded me how much agency it took to get into that room.

He gave me a handwritten acceptance letter on the spot ... I think one secret, at least when it comes to gatekeepers, is that they're usually just looking for high agency and talent.

I've heard this kind of story from other successful people, who tend to come to bimodal conclusions on what it all means. Some conclude that the world correctly recognized their talent and that this is how the world works; talent gets recognized and rewarded. Others conclude that the world is fairly random with respect to talent being rewarded and that they got lucky to get rewarded for their talent when many other people with similar talents who used similar strategies were passed over2.

Another time I was reminded of old baseball scouting reports was when I heard about how a friend of mine who's now an engineering professor at a top Canadian university got there. Let's call her Jane. When Jane was an undergrad at the university she's now a professor at, she was sometimes helpfully asked "are you lost?" when she was on campus. Sometimes this was because, as a woman, she didn't look like was in the right place when she was in an engineering building. Other times, it was because she looked like and talked like someone from rural Canada. Once, a security guard thought she was a homeless person who had wandered onto campus. After a few years, she picked up the right clothes and mannerisms to pass as "the right kind of person", with help from her college friends, who explained to her how one is supposed to talk and dress, but when she was younger, people's first impression was that she was an admin assistant, and now their first impression is that she's a professor's wife because they don't expect a woman to be a professor in her department. She's been fairly successful, but it's taken a lot more work than it would've for someone who looked the part.

On whether or not, in her case, her gate keepers were just looking for agency and talent, she once failed a civil engineering exam because she'd never heard of a "corn dog" and also barely passed an intro programming class she took where the professor announced that anyone who didn't already know how to program was going to fail.

The corn dog exam failure was because there was a question on a civil engineering exam where students were supposed to design a corn dog dispenser. My friend had never heard of a corn dog and asked the professor what a corn dog was. The professor didn't believe that she didn't know what a corn dog was and berated her in front of the entire class to for asking a question that clearly couldn't be serious. Not knowing what a corn dog was, she designed something that put corn inside a hot dog and dispensed a hot dog with corn inside, which failed because that's not what a corn dog is.

It turns out the gatekeepers for civil engineering and programming were not, in fact, just looking for agency and were instead looking for someone who came from the right background. I suspect this is not so different from the CMU professor who admitted a promising student on the spot, it just happens that a lot of people pattern match "smart teenage boy with a story about why their grades and SAT scores are bad" to "promising potential prodigy" and "girl from rural Canada with the top grade in her high school class who hasn't really used a computer before and dresses like a poor person from rural Canada because she's paying for college while raising her younger brother because their parents basically abandoned both of them" to "homeless person who doesn't belong in engineering".

Another thing that reminded me of how funny baseball scouting reports are is a conversation I had with Ben Kuhn a while back.

Me: it's weird how tall so many of the men at my level (senior staff engineer) are at big tech companies. In recent memory, I think I've only been in a meeting with one man who's shorter than me at that level or above. I'm only 1" shorter than U.S. average! And the guy who's shorter than me has worked remotely for at least a decade, so I don't know if people really register his height. And people seem to be even taller on the management track. If I look at the VPs I've been in meetings with, they must all be at least 6' tall.
Ben: Maybe I could be a VP at a big tech company. I'm 6' tall!
Me: Oh, I guess I didn't know how tall 6' tall is. The VPs I'm in meetings with are noticeably taller than you. They're probably at least 6'2"?
Ben: Wow, that's really tall for a minimum. 6'2" is 96%-ile for U.S. adult male

When I've discussed this with successful people who work in big companies of various sorts (tech companies, consulting companies, etc.), men who would be considered tall by normal standards, 6' or 6'1", tell me that they're frequently the shortest man in the room during important meetings. 6'1" is just below the median height of a baseball player. There's something a bit odd about height seeming more correlated to success as a consultant or a programmer than in baseball, where height directly conveys an advantage. One possible explanation would be due to a halo effect, where positive associations about tall or authoritative seeming people contribute to their success.

When I've seen this discussed online, someone will point out that this is because height and cognitive performance are correlated. But if we look at the literature on IQ, the correlation isn't strong enough to explain something like this. We can also observe this if we look at fields where people's mental acuity is directly tested by something other than an IQ test, such as in chess, where most top players are around average height, with some outliers in both directions. Even without looking at the data in detail, this should be expected because correlation between height and IQ is weak, with much the correlation due to the relationship at the low end3, and the correlation between IQ and performance in various mental tasks is also weak (some people will say that it's strong by social science standards, but that's very weak in terms of actual explanatory power even when looking at the population level and it's even weaker at the individual level). And then if we look at chess in particular, we can see that the correlation is weak, as expected.

Since the correlation is weak, and there are many more people around average height than not, we should expect that most top chess players are around average height. If we look at the most dominant chess players in recent history, Carlsen, Anand, and Kasparov, they're 5'8", 5'8", and 5'9", respectively (if you look at different sources, they'll claim heights of plus or minus a couple inches, but still with a pretty normal range; people often exaggerate heights; if you look at people who try to do real comparisons either via photos or in person, measured heights are often lower than what people claim their own height is4).

It's a bit more difficult to find heights of go and shogi players, but it seems like the absolute top modern players from this list I could find heights for (Lee Sedol, Yoshiharu Habu) are roughly in the normal range, with there being some outliers in both directions among elite players who aren't among the best of all time, as with chess.

If it were the case that height or other factors in appearance were very strongly correlated with mental performance, we would expect to see a much stronger correlation between height and performance in activities that relatively directly measure mental performance, like chess, than we do between height and career success, but it's the other way around, which seems to indicate that the halo effect from height is stronger than any underlying benefits that are correlated with height.

If we look at activities where there's a fair amount of gatekeeping before people are allowed to really show their skills but where performance can be measured fairly accurately and where hiring better employees has an immediate, measurable, direct impact on company performance, such as baseball and hockey, we can see that people went with their gut instinct over data for decades after there were public discussions about how data-driven approaches found large holes in people's intuition.

If we then look at programming, where it's somewhere between extremely difficult and impossible to accurately measure individual performance and the impact of individual performance on company success is much less direct than in sports, what should our estimate of how accurate talent assessment be?

The pessimistic view is that it seems implausible that we should expect that talent assessment is better than in sports, where it took decades of there being fairly accurate and rigorous public write-ups of performance assessments for companies to take talent assessment seriously. With programming, talent assessment isn't even far enough along that anyone can write up accurate evaluations of people across the industry, so we haven't even started the decades long process of companies fighting to keep evaluating people based on personal opinions instead of accurate measurements.

Jobs have something equivalent to old school baseball scouting reports at multiple levels. At the hiring stage, there are multiple levels of filters that encode people's biases. A classic study on this is Bertrand and Sendhil Mullainathan's paper, which found that "white sounding" names on resumes got more callbacks for interviews than "black sounding" names and that having a "white sounding" name on the resume increased the returns to having better credentials on the resume. Since then, many variants of this study have been done, e.g., resumes with white sounding names do better than resumes with Asian sounding names, professors with white sounding names are evaluated on CVs are evaluated as having better interpersonal skills than professors with black and Asian sounding names on CVs, etc.

The literature on promotions and leveling is much weaker, but I and other folks who are in highly selected environments that effectively require multiple rounds of screening, each against more and more highly selected folks, such as VPs, senior (as in "senior staff"+) ICs, professors at elite universities, etc., have observed that filtering on height is as severe or more severe than in baseball but less severe than in basketball.

That's curious when, in mental endeavors where the "promotion" criteria are directly selected by performance, such as in chess, height appears to only be very weakly correlated to success. A major issue in the literature on this is that, in general, social scientists look at averages. In a lot of the studies, they simply produce a correlation coefficient. If you're lucky, they may produce a graph where, for each height, they produce an average of something or other. That's the simplest thing to do but this only provides a very coarse understanding of what's going on.

Because I like knowing how things tick, including organizations and people's opinions, I've (informally, verbally) polled a lot of engineers about what they thought about other engineers. What I found was that there was a lot of clustering of opinions, resulting in clusters of folks that had rough agreement about who did excellent work. Within each cluster, people would often disagree about the ranking of engineers, but they would generally agree on who was "good to excellent".

One cluster was (in my opinion; this could, of course, also just be my own biases) people who were looking at the output people produced and were judging people based on that. Another cluster was of people who were looking at some combination of height and confidence and were judging people based on that. This one was a mystery to me for a long time (I've been asking people questions like this and collating the data out of habit, long before I had the idea to write this post and, until I recognized the pattern, I found it odd that so many people who have good technical judgment, as evidenced by their ability to do good work and make comments showing good technical judgment, highly evaluated so many people who so frequently said blatantly incorrect things and produced poorly working or even non-working systems). Another cluster was around credentials, such as what school someone went to or what the person was leveled at or what prestigious companies they'd worked for. People could have judgment from multiple clusters, e.g., some folks would praise both people who did excellent technical work as well as people who are tall and confident. At higher levels, where it becomes more difficult to judge people's work, relatively fewer people based their judgment on people's output.

When I did this evaluation collating exercise at the startup I worked at, there was basically only one cluster and it was based on people's output, with fairly broad consensus about who the top engineers were, but I haven't seen that at any of the large companies I've worked for. I'm not going to say that means evaluation at that startup was fair (perhaps all of us were falling prey to the same biases), but at least we weren't falling prey to the most obvious biases.

Back to big companies, if we look at what it would take to reform the promotion system, it seems difficult to do when biased because many individual engineers are biased. Some companies have committees handle promotions in order to reduce bias, but the major inputs to the system still have strong biases. The committee uses, as input, recommendations from people, many of whom let those biases have more weight than their technical judgment. Even if we, hypothetically, introduced a system that identified whose judgments were highly correlated with factors that aren't directly relevant to performance and gave those recommendations no weight, people's opinions often limit the work that someone can do. A complaint I've heard from some folks who are junior is that they can't get promoted because their work doesn't fulfill promo criteria. When they ask to be allowed to do work that could get them promoted, they're told they're too junior to do that kind of work. They're generally stuck at their level until they find a manager who believes in their potential enough to give them work that could possibly result in a promo if they did a good job. Another factor that interacts with this is that it's easier to transfer to a team where high-impact work is available if you're doing well and/or having high "promo velocity", i.e., are getting promoted frequently and harder if you're doing poorly or even just have low promo velocity and aren't doing particularly poorly. At higher levels, it's uncommon to not be able to do high-impact work, but it's also very difficult to separate out the impact of individual performance and biases because a lot of performance is about who you can influence, which is going to involve trying to influence people who are biased if you need to do it at scale, which you generally do to get promoted at higher levels. The nested, multi-level, impact of bias makes it difficult to change the system in a way that would remove the impact of bias.

Although it's easy to be pessimistic when looking at the system as a whole, it's also easy to be optimistic when looking at what one can do as an individual. It's pretty easy to do what Bill Wight (the scout known for recommending "funny looking" baseball players) did and ignore what other people incorrectly think is important5. I worked for a company that did this which had, by far, the best engineering team of any company I've ever worked for. They did this by ignoring the criteria other companies cared about, e.g., hiring people from non-elite schools instead of focusing on pedigree, not ruling people out for not having practiced solving abstract problems on a whiteboard that people don't solve in practice at work, not having cultural fit criteria that weren't related to job performance (they did care that people were self-directed and would function effectively when given a high degree of independence), etc.6

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Peter Bhat Harkins, Yossi Kreinin, Pam Wolf, Laurie Tratt, Leah Hanson, Kate Meyer, Heath Borders, Leo T M, Valentin Hartmann, Sam El-Borai, Vaibhav Sagar, Nat Welch, Michael Malis, Ori Berstein, Sophia Wisdom, and Malte Skarupke for comments/corrections/discussion.

Appendix: other factors

This post used height as a running example because it's both something that's easy to observe is correlated to success in men which has been studied across a number of fields. I would guess that social class markers / mannerisms, as in the Jane example from this post, have at least as much impact. For example, a number of people have pointed out to me that the tall, successful, people they're surrounded by say things with very high confidence (often incorrect things, but said confidently) and also have mannerisms that convey confidence and authority.

Other physical factors also seem to have a large impact. There's a fairly large literature on how much the halo effect causes people who are generally attractive to be rated more highly on a variety of dimensions, e.g., morality. There's a famous ask metafilter (reddit before there was reddit) answer to a quesiton that's something like "how can you tell someone is bad?" and the most favorited answer (I hope for ironic reasons, although the answerer seemed genuine) is that they have bad teeth. Of course, in the U.S., having bad teeth is a marker of childhood financial poverty, not impoverished moral character. And, of course, gender is another dimension that people appear to filter on for reasons unrelated to talent or competence.

Another is just random luck. To go back to the baseball example, one of the few negative scouting reports on Chipper Jones came from a scout who said

Was not aggressive w/bat. Did not drive ball from either side. Displayed non-chalant attitude at all times. He was a disappointment to me. In the 8 games he managed to collect only 1 hit and hit very few balls well. Showed slap-type swing from L.side . . . 2 av. tools

Another scout, who saw him on more typical days, correctly noted

Definite ML prospect . . . ML tools or better in all areas . . . due to outstanding instincts, ability, and knowledge of game. Superstar potential.

Another similarly noted:

This boy has all the tools. Has good power and good basic approach at the plate with bat speed. Excellent make up and work-habits. Best prospect in Florida in the past 7 years I have been scouting . . . This boy must be considered for our [1st round draft] pick. Does everything well and with ease.

There's a lot of variance in performance. If you judge performance by watching someone for a short period of time, you're going to get wildly different judgements depending on when you watch them.

If you read the blind orchestra audition study that everybody cites, the study itself seems poor quality and unconvincing, but it also seems true that blind auditions were concomitant with an increase in orchestras hiring people who didn't look like what people expected musicians to look like. Blind auditions, where possible, seem like something good to try.

As noted previously, a professor remarked that doing hiring over zoom accidentally made height much less noticeable than normal and resulted in at least one university department hiring a number of professors who are markedly less tall than professors who were previously hired.

Me on how tech interviews don't even act as an effective filter for the main thing they nominally filter for.

Me on how prestige-focused tech hiring is.

@ArtiKel on Cowen and Gross's book on talent and on funding people over projects. A question I've had for a long time is whether the less-mainstream programs that convey prestige via some kind of talent selection process (Thiel Fellowship, grants from folks like Tyler Cowen, Patrick Collison, Scott Alexander, etc.) are less biased than traditional selection processes or just differently biased. The book doesn't appear to really answer this question, but it's food for thought. And BTW, I view these alternative processes as highly value even if they're not better and, actually, even if they're somewhat worse, because their existence gives the world a wider portfolio of options for talent spotting. But, even so, I would like to know if the alternative processes are better than traditional processes.

Alexy Guezy on where talent comes from.

An anonymous person on talent misallocation.

Thomas Ptacek on actually attempting to look at relevant signals when hiring in tech.

Me on the use of sleight of hand in an analogy meant to explain the importance of IQ and talent, where the sleight of hand is designed to make it seem like IQ is more important than it actually is.

Jessica Nordell on trans experiences demonstrating differences between how men and women are treated.

The Moneyball book, of course. Although, for the real nerdy details, I'd recommend reading the old baseballthinkfactory archives from back when the site was called "baseball primer". Fans were, in real time, calling out who would be successful and generally better greater success than baseball teams of the era. The site died off as baseball teams started taking stats seriously, leaving fan analysis in the dust since teams have access to both much better fine-grained data as well as time to spend on serious analysis than hobbyists, but it was interesting to watch hobbyists completely dominate the profession using basic data anlaysis techniques.


  1. Jose Altuve comes from the modern era of statistics-driven decision making and therefore cannot be a counterexample. [return]
  2. There's a similar bimodal split when I see discussions among people who are on the other side of the table and choose who gets to join an elite institution vs. not. Some people are utterly convinced that their judgment is basically perfect ("I just know", etc.), and some people think that making judgment calls on people is a noisy process and you, at best, get weak signal. [return]
  3. Estimates range from 0 to 0.3, with Teasdale et al. finding that the correlation decreased over time (speculated to be due to better nutrition) and Teasdale et al. finding that, the correlation was significantly stronger than on average in the bottom tail (bottom 2% of height) and significantly weaker than on average at the top tail (top 2% of height), indicating that much of the overall correlation comes from factors that cause both reduced height and IQ.

    In general, for a correlation coefficient of x, it will explain x^2 of the variance. So even if the correlation were not weaker at the high end and we had a correlation coefficient of 0.3, that would only explain 0.3 = 0.09 of the variance, i.e., 1 - 0.09 = 0.91 would be explained by other factors.

    [return]
  4. When I did online dating, I frequently had people tell me that I must be taller than I am because they're so used to other people lying about their heights on dating profiles that they associated my height with a larger number than the real number. [return]
  5. On the other side of the table, what one can do when being assessed, I've noticed that, at work, unless people are familiar with my work, they generally ignore me in group interactions, like meetings. Historically, things that have worked for me and gotten people to stop ignoring me were doing doing an unreasonably large amount of high-impact work in a small period of time (while not working long hours), often solving a problem that people thought was impossible to solve in the timeframe, which made it very difficult for people to not notice my work; another was having a person who appears more authoritative than me get the attention of the room and ask people to listen to me; and also finding groups (teams or orgs) that care more about the idea than the source of the idea. More recently, some things that have worked are writing this blog and using mediums where a lot of the cues that people use as proxies for competence aren't there (slack, and to a lesser extent, video calls).

    In some cases, the pandemic has accidentally caused this to happen in some dimensions. For example, a friend of mine mentioned to me that their university department did video interviews during the pandemic and, for the first time, hired a number of professors who weren't strikingly tall.

    [return]
  6. When at a company that has biases in hiring and promo, it's still possible to go scouting for talent in a way that's independent of the company's normal criteria. One method that's worked well for me is to hire interns, since the hiring criteria for interns tends to be less strict. Once someone is hired as an intern, if their work is great and you know how to sell it, it's easy to get them hired full-time.

    For example, at Twitter, I hired two interns to my team. One, as an intern, wrote the kernel patch that solved the container throttling problem (at the margin, worth hundreds of millions of dollars a year) and has gone on to do great, high-impact, work as a full-time employee. The other, as an intern, built out across-the-fleet profiling, a problem many full-time staff+ engineers had wanted to solve but that no one had solved and is joining Twitter as a full-time employee this fall. In both cases, the person was overlooked by other companies for silly reasons. In the former case, there was a funny combination of reasons other companies weren't interested in hiring them for a job that utilized their skillset, including location / time zone (Australia). From talking to them, they clearly had deep knowledge about computer performance that would be very rare even in an engineer with a decade of "systems" experience. There were jobs available to them in Australia, but teams doing performance work at the other big tech companies weren't really interested in taking on an intern in Australia. For the kind of expertise this person had, I was happy to shift my schedule to a bit late for a while until they ramped up, and it turned out that they were highly independent and didn't really need guidance to ramp up (we talked a bit about problems they could work on, including the aforementioned container throttling problem, and then they came back with some proposed approaches to solve the problem and then solved the problem). In the latter case, they were a student who was very early in their university studies. The most desirable employers often want students who have more classwork under their belt, so we were able to hire them without much competition. Waiting until a student has a lot of classes under their belt might be a good strategy on average, but this particular intern candidate had written some code that was good for someone with that level of experience and they'd shown a lot of initiative (they reverse engineered the server protocol for a dying game in order to reimplement a server so that they could fix issues that were killing the game), which is a much stronger positive signal than you'll get out of interviewing almost any 3rd year student who's looking for an internship.

    Of course, you can't always get signal on a valuable skill, but if you're actively scouting for people, you don't need to always get signal. If you occasionally get a reliable signal and can hire people who you have good signal on who are underrated, that's still valuable! For Twitter, in three intern seasons, I hired two interns, the first of whom already made "staff" and the second of whom should get there very quickly based on their skills as well as the impact of their work. In terms of ROI, spending maybe 30 hours a year on the lookout for folks who had very obvious signals indicating they were likely to be highly effective was one of the most valuable things I did for the company. The ROI would go way down if the industry as a whole ever started using effective signals when hiring but, for the reasons discussed in the body of this post, I expect progress to be slow enough that we don't really see the amount of change that would make this kind of work low ROI in my lifetime.

    [return]

A decade of major cache incidents at Twitter

2022-02-02 08:00:00

This was co-authored with Yao Yue

This is a collection of information on severe (SEV-0 or SEV-1, the most severe incident classifications) incidents at Twitter that were at least partially attributed to cache from the time Twitter started using its current incident tracking JIRA (2012) to date (2022), with one bonus incident from before 2012. Not including the bonus incident, there were 6 SEV-0s and 6 SEV-1s that were at least partially attributed to cache in the incident tracker, along with 38 less severe incidents that aren't discussed in this post.

There are a couple reasons we want to write this down. First, historical knowledge about what happens at tech companies is lost at a fairly high rate and we think it's nice to preserve some of it. Second, we think it can be useful to look at incidents and reliability from a specific angle, putting all of the information into one place, because that can sometimes make some patterns very obvious.

On knowledge loss, when we've seen viral Twitter threads or other viral stories about what happened at some tech company, when we look into what happened, the most widely spread stories are usually quite wrong, generally for banal reasons. One reason is that outrageously exaggerated stories are more likely to go viral, so those are the ones that tend to be remembered. Another is that there's a cottage industry of former directors / VPs who tell self-aggrandizing stories about all the great things they did that, to put it mildly, frequently distort the truth (although there's nothing stopping ICs from doing this, the most spread false stories we see tend to come from people on the management track). In both cases, there's a kind of Gresham's law of stories in play, where incorrect stories tend to win out over correct stories.

And even when making a genuine attempt to try to understand what happened, it turns out that knowledge is lost fairly quickly. For this and other incident analysis projects we've done, links to documents and tickets from the past few years tend to work (90%+ chance), but older links are less likely to work, with the rate getting pretty close to 0% by the time we're looking at things from 2012. Sometimes, people have things squirreled away in locked down documents, emails, etc. but those will often link to things that are now completely dead, and figuring out what happened requires talking to a bunch of people who will, due to the nature of human memory, give you inconsistent stories that you need to piece together1.

On looking at things from a specific angle, while looking at failures broadly and classifying and collating all failures is useful, it's also useful to drill down into certain classes of failures. For example, when Rebecca Isaacs and Dan Luu did an (internal, non-public) analysis of Twitter failover tests (from 2018 to 2020), which found a number of things that led to operational changes. In some sense, there was no new information in the analysis since the information we got all came from various documents that already existed, but putting into one place made a number of patterns obvious that weren't obvious when looking at incidents one at a time across multiple years.

This document shouldn't cause any changes at Twitter since looking at what patterns exist in cache incidents over time and what should be done about that has already been done, but collecting these into one place may still be useful to people outside of Twitter.

As for why we might want to look at cache failures (as opposed to failures in other systems), cache is relatively commonly implicated in major failures, as illustrated by this comment Yao made during an internal Twitter War Stories session (referring to the dark ages of Twitter, in operational terms):

Every single incident so far has at least mentioned cache. In fact, for a long time, cache was probably the #1 source of bringing the site down for a while.

In my first six months, every time I restarted a cache server, it was a SEV-0 by today's standards. On a good day, you might have 95% Success Rate (SR) [for external requests to the site] if I restarted one cache ...

Also, the vast majority of Twitter cache is (a fork of) memcached2, which is widely used elsewhere, making the knowledge more generally applicable than if we discussed a fully custom Twitter system.

More generally, caches are nice source of relatively clean real-world examples of common distributed systems failure modes because of how simple caches are. Conceptually, a cache server is a high-throughput, low-latency RPC server plus a library that manages data, such as memory and/or disk and key value indices. For in memory caches, the data management side should be able to easily outpace the RPC side (a naive in-memory key-value library should be able to hit millions of QPS per core, whereas a naive RPC server that doesn't use userspace networking, batching and/or pipelining, etc. will have problems getting to 1/10th that level of performance). Because of the simplicity of everything outside of the RPC stack, cache can be thought of as an approximation of nearly pure RPC workloads, which are frequently important in heavily service-oriented architectures.

When scale and performance are concerns, cache will frequently use sharded clusters, which then subject cache to the constraints and pitfalls of distributed systems (but with less emphasis on synchronization issues than with some other workloads, such as strongly consistent distributed databases, due to the emphasis on performance). Also, by the nature of distributed systems, users of cache will be exposed to these failure modes and be vulnerable to or possibly implicated in failures caused by the cascading impact of some kinds of distributed systems failures.

Cache failure modes are also interesting because, when cache is used to serve a significant fraction of requests or fraction of data, cache outages or even degradation can easily cause a total outage because an architecture designed with cache performance in mind will not (and should not) have backing DB store performance that's sufficient to keep the site up.

Compared to most workloads, cache is more sensitive to performance anomalies below it in the stack (e.g., kernel, firmware, hardware, etc.) because it tends to have relatively high-volume and low-latency SLOs (because the point of cache is that it's fast) and it spends (barring things like userspace networking) a lot of time in kernel (~80% as a ballpark for Twitter memcached running normal kernel networking). Also, because cache servers often run a small number of threads, cache is relatively sensitive to being starved by other workloads sharing the same underlying resources (CPU, memory, disk, etc.). The high volume and low latency SLOs worsen positive feedback loops that lead to a "death spiral", a classic distributed systems failure mode.

When we look at the incidents below, we'll see that most aren't really due to errors in the logic of cache, but rather, some kind of anomaly that causes an insufficiently mitigated positive feedback loop that becomes a runaway feedback loop.

So, when reading the incidents below, it may be helpful to read them with an eye towards how cache interacts with things above cache in the stack that call caches and things below cache in the stack that cache interacts with. Something else to look for is how frequently a major incident occured due to an incompletely applied fix for an earlier incident or because something that was considered a serious operational issue by an engineer wasn't prioritized. These were both common themes in the analysis Rebecca Isaacs and Dan Luu did on causes of failover test failures as well.

2011-08 (SEV-0)

For a few months, a significant fraction of user-initiated changes (such as username, screen name, and password) would get reverted. There was continued risk of this for a couple more years.

Background

At the time, the Rails app had single threaded workers, managed by a single master that did health checks, redeploys, etc. If a worker got stuck for 30 seconds, the master would kill the worker and restart it.

Teams were running on bare metal, without the benefit of a cluster manager like mesos or kubernetes. Teams had full ownership of the hardware and were responsible for kernel upgrades, etc.

The algorithm for deciding which shard a key would land involved a hash. If a node went away, the keys that previously hashed to that node would end up getting hashed to other nodes. Each worker had a client that made its own independent routing decisions to figure out which cache shard to talk to, which means that each worker made independent decisions as to which cache nodes were live and where keys should live. If a client thinks that a host isn't "good" anymore, that host is said to be ejected.

Incident

On Nov 8, a user changed their name from [old name] to [new name]. One week later, their username reverted to [old name].

Between Nov 8th and early December, tens of these tickets were filed by support agents. Twitter didn't have the instrumentation to tell where things were going wrong, so the first two weeks of investigation was mostly getting metrics into the rails app to understand where the issue was coming from. Each change needed to be coordinated with the deploy team, which would take at least two hours. After the rails app was sufficiently instrumented, all signs pointed to cache as the source of the problem. The full set of changes needed to really determine if cache was at fault took another week or two, which included adding metrics to track cache inconsistency, cache exception paths, and host ejection.

After adding instrumentation, an engineer made the following comment on a JIRA ticket in early December:

I turned on code today to allow us to see the extent to which users in cache are out of sync with users in the database, at the point where we write the user in cache back to the database, at the point where we write the user in cache back to the database. The number is roughly 0.2% ... Checked 150 popular users on Twitter to see how many caches they were in (should be at most one). Most of them were on at least two, with some on as many as six.

The first fix was to avoid writing stale data back to the DB. However, that didn't address the issue of having multiple copies of the same data in different cache shards. The second fix, intended to reduce the number of times keys appeared in multiple locations, was to retry multiple times before ejecting a host. The idea is that, if a host is really permanently down, that will trigger an alert, but alerts for dead hosts weren't firing, so the errors that were causing host ejections should be transient and therefore, if a client keeps retrying, it should be able to find a key "where it's supposed to be". And then, to prevent flapping keys from hosts having many transient errors, the time that ejected hosts were kept ejected was increased.

This change was tested on one cache and the rolled out to other caches. Rolling out the change to all caches immediately caused the site to go down because ejections still occurred and the longer ejection time caused the backend to get stressed. At the time, the backend was MySQL, which, as configured, could take an arbitrarily long amount of time to return a request under high load. This caused workers to take an arbitrarily long time to return results, which caused the master to kill workers, which took down the site when this happened at scale since not enough workers were available to serve requests.

After rolling back the second fix, users could still see stale data since, even though stale data wasn't being written back to the DB, cache updates could happen to a key in one location and then a client could read a stale, cached, copy of that key in another location. Another mitigation that was deployed was to move the user data cache from a high utilization cluster to a low utilization cluster.

After debugging further, it was determined that retrying could address ejections occurring due to "random" causes of tail latency, but there was still a high rate of ejections coming from some kind of non-random cause. From looking at metrics, it was observed that there was sometimes a high rate of packet loss and that this was correlated with incoming packet rate but not bandwidth usage. Looking at the host during times of high packet rate and packet loss showed that CPU0 was spending 65% to 70% of time handling soft IRQs, indicating that the packet loss was likely coming from CPU0 not being able to keep with the packet arrival rate.

The fix for this was to set IRQ affinity to spread incoming packet processing across all of the physical cores on the box. After deploying the fix, packet loss and cache inconsistency was observed on the new cluster that user data was moved to but not the old cluster.

At this point, it's late December. Looking at other clusters, it was observed that some other clusters also had packet loss. Looking more closely, the packet loss was happening every 20 hours and 40 minutes on some specific machines. All machines that had this issue were a particular hardware SKU with a particular BIOS version (the latest version; machines from that SKU with earlier BIOS versions were fine). It turned out that hosts with this BIOS version were triggering the BMC to run a very expensive health check every 20 hours and 40 minutes which interrupted the kernel for the duration, preventing any packets from being processed, causing packet drops.

It turned out that someone from the kernel team had noticed this exact issue about six months earlier and had tried to push a kernel config change that would fix the issue (increasing the packet ring buffer size so that transient issues wouldn't cause the packet drops when the buffer overflowed). Although that ticket was marked resolved, the fix was never widely rolled out for reasons that are unclear.

A quick mitigation that was deployed was to stagger host reboot times so that clusters didn't have coordinated packet drops across the entire cluster at the same time.

Because the BMC version needs to match the BIOS version and the BMC couldn't be rolled back, it wasn't possible to fix the issue by rolling back the BIOS. In order to roll the BMC and BIOS forward, the HWENG team had to do emergency testing/qualification of those, which was done as quickly as possible, at which point the BIOS fix was rolled out and the packet loss went away.

The total time for everything combined was about two months.

However, this wasn't a complete fix since the host ejection behavior was still unchanged and any random issue that caused one or more clients but not all clients to eject a cache shard would still result in inconsistency. Fixing that required changing cache architectures, which couldn't be quickly done (that took about two years).

Mitigations / fixes:

Lessons learned:

2012-07 (SEV-1)

Non-personalized trends didn't show up for ~10% of users for about 10 hours, who got an empty trends box.

An update to the rails app was deployed, after which the trends cache stopped returning results. This only impacted non-personalized trends because those were served directly from rails (personalized trends were served from a separate service).

Two hours in, it was determined that this was due to segfaults in the daemon that refreshes the trends cache, which was due to running out of memory. The reason this happened was that the deployed change added a Thrift field to the Trend object, which increased the trends cache refresh daemon memory usage beyond the limit.

There was an alert on the trends cache daemon failing, but it only checked for the daemon starting a run successfully, not for it finishing a run successfully.

Mitigations / fixes:

Lessons learned

2012-07 (SEV-0)

This was one of the more externally well-known Twitter incidents because this one resulted in the public error page showing, with no images or CSS:

Twitter is currently down for <% = reason %>

We expect to be back in <% = deadline %>

The site was significantly impacted for about four hours.

The information on this one is a bit sketchy since records from this time are highly incomplete (the JIRA ticket for this notes, "This incident was heavily Post-Mortemed and reviewed. Closing incident ticket.", but written documentation on the incident has mostly been lost).

The trigger for this incident was power loss in two rows of racks. In terms of the impact on cache, 48 hosts lost power and were restarted when power came back up, one hour later. 37 of those hosts had their caches fail to come back up because a directory that a script expected to exist wasn't mounted on those hosts. "Manually" fixing the layouts on those hosts took 30 minutes and caches came back up shortly afterwards.

The directory wasn't actually necessary for running a cache server, at least as they were run at Twitter at the time. However, there was a script that checked for the existence of the directory on startup that was not concurrently updated when the directory was removed from the layout setup script a month earlier.

Something else that increased debugging time was that /proc wasn't mounted properly on hosts when they came back up. Although that wasn't the issue, it was unusual and it took some time to determine that it wasn't part of the incident and was an independent non-urgent issue to be fixed.

If the rest of the site were operating perfectly, the cache issue above wouldn't have caused such a severe incident, but a number of other issues in combination caused a total site outage that lasted for an extended period of time.

Some other issues were:

Cache mitigations / fixes:

Other mitigations / fixes (highly incomplete):

Lessons learned:

2013-01 (SEV-0)

Site outage for 3h30m

An increase in load (AFAIK, normal for the day, not an outlier load spike) caused a tail latency increase on cache. The tail latency increased on cache was caused by IRQ affinities not being set on new cache hosts, which caused elevated queue lengths and therefore elevated latency.

Increased cache latency along with the design of tweet service using cache caused shards of the service using cache to enter a GC death spiral (more latency -> more outstanding requests -> more GC pressure -> more load on the shard -> more latency), which then caused increased load on remaining shards.

At the time, the tweet service cache and user data cache were colocated onto the same boxes, with 1 shard of tweet service cache and 2 shards of user data cache per box. Tweet service cache added the new hosts without incident. User data cache then gradually added the new hosts over the course of an evening, also initially without incident. But when morning peak traffic arrived (peak traffic is in the morning because that's close to both Asian and U.S. peak usage times, with Asian countries generally seeing peak usage outside of "9-5" work hours and U.S. peak usage during work hours), that triggered the IRQ affinity issue. Tweet service was much more impacted by the IRQ affinity issue than the user data service.

Mitigations / fixes:

2013-09 (SEV-1)

Overall site success rate dropped to 92% in one datacenter. Users were impacted for about 15 minutes.

The timeline service lost access to about 75% of one of the caches it uses. The cache team made a serverset change for that cache and the timeline service wasn't using the recommended mechanism to consume the cache serverset path and didn't "know" which servers were cache servers.

Mitigations / fixes:

2014-01 (SEV-0)

The site went down in one datacenter, impacting users whose requests went to that datacenter for 20 minutes.

The tweet service started sending elevated load to caches. A then-recent change removed the cap on the number of connections that could be made to caches. At the time, when caches hit around ~160k connections, they would fail to accept new connections. This caused the monitoring service to be unable to connect to cache shards, which caused the monitoring service to restart cache shards, causing an outage.

In the months before the outage, there were five tickets describing various ingredients for the outage.

In one ticket, a follow-up to a less serious incident caused by a combination of bad C-state configs and SMIs, it was noted that caches stopped accepting connections at ~160k connections. An engineer debugged the issue in detail, figured out what was going on, and suggested a number of possible paths to mitigating the issue.

One ingredient is that, especially when cache is highly loaded, cache can not have accepted the connection even though the kernel will have established the TCP connection.

The client doesn't "know" that the connection isn't really open to the cache and will send a request and wait for a response. Finagle may open multiple connections if it "thinks" that more concurrency is needed. After 150ms, the request will time out. If the queue is long on the cache side, this is likely to be before the cache has even attempted to do anything about the request.

After the timeout, Finagle will try again and open another connection, causing the cache shard to become more overloaded each time this happens.

On the client side, each of these requests causes a lot of allocations, causing a lot of GC pressure.

At the time, settings allowed for 5 requests before marking a node as unavailable for 30 seconds, with 16 connection parallelism and each client attempting to connect to 3 servers. When all those numbers were multiplied out by the number of shards, that allowed the tweet service to hit the limits of what cache can handle before connections stop being accepted.

On the cache side, there was one dispatcher thread and N worker threads. The dispatcher thread would call listen and accept and then put work onto queues for worker threads. By default, the backlog length was 1024. When accept failed due to an fd limit, the dispatcher thread set backlog to 0 in listen and ignored all events coming to listening fds. Backlog got reset to normal and connections were accepted again when a connection was closed, freeing up an fd.

Before the major incident, it was observed that after the number of connections gets "too high", connections start getting rejected. After a period of time, the backpressure caused by rejected connections would allow caches to recover.

Another ingredient to the issue was that, on one hardware SKU, there were OOMs when the system ran out of 32kB pages under high cache load, which would increase load to caches that didn't OOM. This was fixed by a Twitter kernel engineer in

commit 96c7a2ff21501691587e1ae969b83cbec8b78e08
Author: Eric W. Biederman <[email protected]>
Date:   Mon Feb 10 14:25:41 2014 -0800

    fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
    
    Recently due to a spike in connections per second memcached on 3
    separate boxes triggered the OOM killer from accept.  At the time the
    OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
    problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
    hold a bitmap, and there was sufficient fragmentation that the largest
    page available was 8KiB.
    
    I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
    but I do agree that order 3 allocations are very likely to succeed.
    
    There are always pathologies where order > 0 allocations can fail when
    there are copious amounts of free memory available.  Using the pigeon
    hole principle it is easy to show that it requires 1 page more than 50%
    of the pages being free to guarantee an order 1 (8KiB) allocation will
    succeed, 1 page more than 75% of the pages being free to guarantee an
    order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
    the pages being free to guarantee an order 3 allocate will succeed.
    
    A server churning memory with a lot of small requests and replies like
    memcached is a common case that if anything can will skew the odds
    against large pages being available.
    
    Therefore let's not give external applications a practical way to kill
    linux server applications, and specify __GFP_NORETRY to the kmalloc in
    alloc_fdmem.  Unless I am misreading the code and by the time the code
    reaches should_alloc_retry in __alloc_pages_slowpath (where
    __GFP_NORETRY becomes signification).  We have already tried everything
    reasonable to allocate a page and the only thing left to do is wait.  So
    not waiting and falling back to vmalloc immediately seems like the
    reasonable thing to do even if there wasn't a chance of triggering the
    OOM killer.
    
    Signed-off-by: "Eric W. Biederman" <[email protected]>
    Cc: Eric Dumazet <[email protected]>
    Acked-by: David Rientjes <[email protected]>
    Cc: Cong Wang <[email protected]>
    Cc: <[email protected]>
    Signed-off-by: Andrew Morton <[email protected]>
    Signed-off-by: Linus Torvalds <[email protected]>

and is another example of why companies the size of Twitter get value out of having a kernel team.

Another ticket noted the importance of having standardized settings for cache hosts for things like IRQ affinity, C-states, turbo boost, NIC bonding, and firmware version, which was a follow up to another ticket noting that the tweet service sometimes saw elevated latency on some hosts, which was ultimately determined to be due to increased SMIs after a kernel upgrade impacting one hardware SKU type due to some interactions between the kernel and the firmware version.

Cache Mitigations / fixes:

Tests with these mitigations indicated that, even without fixes to clients to prevent clients from "trying to" overwhelm caches, these prevented cache from falling over under conditions similar to the incident.

Tweet service Mitigations / fixes:

Lessons learned:

2014-03 (SEV-0)

A tweet from Ellen was retweeted very frequently during the Oscars, which resulted in search going down for about 25 minutes as well as a site outage that prevented many users from being able to use the site.

This incident had a lot of moving parts. From a cache standpoint, this was another example of caches becoming overloaded due to badly behaved clients.

It's similar to the 2014-01 incident we looked at, except that the cache-side mitigations put in place for that incident weren't sufficient because the "attacking" clients picked more aggressive values than were used by the tweet service during 2014-01 incident and, by this time, some caches were running in containerized environments on shared mesos, which made them vulnerable to throttling death spirals.

The major fix to this direct problem was to add pipelining to the Finagle memcached client, allowing most clients to get adequate throughput with only 1 or 2 connections, reducing the probability of clients hammering caches until they fall over.

For other services, there were close to 50 fixes put into place across many services. Some major themes were for the fixes were:

2016-01 (SEV-0)

SMAP, a former Japanese boy band that became a popular adult J-pop group as well the hosts of a variety show that was frequently the #1 watched show in Japan, held a conference to falsely deny rumors they were going to break up. This resulted in an outage in one datacenter that impacted users routed to that datacenter for ~20 minutes, until that DC was failed away from. It took about six hours for services in the impacted DC to recover.

The tweet service in one DC had a load spike, which caused 39 cache shard hosts to OOM kill processes on those hosts. The cluster manager didn't automatically remove the dead nodes from the server set because there were too many dead nodes (it will automatically remove nodes if a few fail, but if too many fail, this change is not automated due to the possibility of exacerbating some kind of catastrophic failure with an automated action since removing nodes from a cache server set can cause traffic spikes to persistent storage). When cache oncalls manually cleaned up the dead nodes, the service that should have restarted them failed to do so because a puppet change had accidentally removed cache related configs for the service would normally restart the nodes. Once the bad puppet commit was reverted, the cache shards came back up, but these initially came back too slowly and then later came back too quickly, causing recovery of tweet service success rate take an extended period of time.

The cache shard hosts were OOM killed because too much kernel socket buffer memory was allocated.

The initial fix for this was to limit TCP buffer size on hosts to 4 GB, but this failed a stress test and it was determined that memory fragmentation on hosts with high uptime (2 years) was the reason for the failure and the mitigation was to reboot hosts more frequently to clean up fragmentation.

Mitigations / fixes:

2016-02 (SEV-1)

This was the failed stress test from the 2016-01 SEV-0 mentioned above. This mildly degraded success rate to the site for a few minutes until the stress test was terminated.

2016-07 (SEV-1)

A planned migration of user data cache from dedicated hosts to Mesos led to significant service degradation in one datacenter and then minor degradation in another datacenter. Some existing users were impacted and all basically new user signups failed for about half an hour.

115 new cache instances were added to a serverset as quickly as the cluster manager could add them, reducing cache hit rates. The cache cluster manager was expected to add 1 shard every 20 minutes, but the configuration change accidentally changed the minimum cache cluster size, which "forced" the cluster manager to add the nodes as quickly as it could.

Adding so many nodes at once reduced user data cache hit rate from the normal 99.8% to 84%. In order to stop this from getting worse, operators killed the cluster manager to prevent it from adding more nodes to the serverset and then redeployed the cluster manager in its previous state to restore the old configuration, which immediately improved user data cache hit rate.

During the time period cache hit rate was degraded, the backing DB saw a traffic spike that caused long GC pauses. This caused user data service requests that missed cache to have a 0% success rate when querying the backing DB.

Although there was rate limiting in place to prevent overloading the backing DB, the thresholds were too high to trigger. In order to recover the backing DB, operators did a rolling restart and deployed strict rate limits. Since one datacenter was failed away from due to the above, the strict rate limit was hit in another datacenter because the failing away from one datacenter caused elevated traffic in another datacenter. This caused mildly reduced success rate in the user data service because requests were getting rejected by the strict rate limit, which is why this incident also impacted a datacenter that wasn't impacted by the original cache outage.

Mitigations / fixes:

2018-04 (SEV-0)

A planned test datacenter failover caused a partial site outage for about 1 hour. Degraded success rate was noticed 1 minute into the failover. The failover test was immediately reverted, but it took most of an hour for the site to fully recover.

The initial site degradation came from increased error rates in the user data service, which was caused by cache hot keys. There was a mechanism intended to cache hot keys, which sampled 1% of events (with sampling being used in order to reduce overhead, the idea being that if a key is hot, it should be noticed even with sampling) and put sampled keys into a FIFO queue with a hash map to count how often each key appears in the queue.

Although this worked for previous high load events, there were some instances where this didn't work as well as intended (but weren't a root cause in an incident) when the values are large because the 1% sampling rate wouldn't allow the cache to "notice" a hot key quickly enough in the case where there were large (and therefore expensive) values. The original hot key detection logic was designed for tweet service cache, where the largest keys were about 5KB. This same logic was then used for other caches, where keys can be much larger. User data cache wasn't a design consideration for hot keys because, at the time hot key promotion was designed, the user data cache wasn't having hot key issues because, at the time, the items that would've been the hottest keys were served from an in-process cache.

The large key issue was exacerbated by the use of FNV1-32 for key hashing, which ignores the least significant byte. The data set that was causing a problem had a lot of its variance inside the last byte, so the use of FNV1-32 caused all of the keys with large values to be stored on small number of cache shards. There were suggestions to move to migrate off of FNV1-32 at least as far back as 2014 for this exact reason and a more modern hash function was added to a utility library, but some cache owners chose not to migrate.

Because the hot key promotion logic didn't trigger, traffic to the hot cache shards saturated NIC bandwidth to the shards that had hot keys that were using 1Gb NICs (Twitter hardware is generally heterogenous unless someone ensures that clusters only have specific characteristics; although many cache hosts had 10Gb NICs, many also had 1Gb NICs).

Fixes / mitigations:

2018-06 (SEV-1)

During a test data center failover, success rate for some kinds of actions dropped to ~50% until the test failover was aborted, about four minutes later.

From a cache standpoint, the issue was that tweet service cache shards were able to handle much less traffic than expected (about 50% as much traffic) based on load tests that weren't representative of real traffic, resulting in the tweet service cache being under provisioned. Among the things that made the load test setup unrealistic were:

Also, a reason for degraded cache performance was that, once a minute, container-based performance counter collection was run for ten seconds, which was fairly expensive because many more counters were being collected than there are hardware counters, requiring the kernel to do expensive operations to switch out which counters are being collected.

The degraded performance both increased latency enough during the window when performance counters were collected that cache shards were unable to complete their work before hitting container throttling limits, degrading latency to the point that tweet service requests would time out. As configured, after 12 consecutive failures to a single cache node, tweet service clients would mark the node as dead for 30 seconds and stop issuing requests to it, causing the node to get no traffic for 30 seconds as clients independently made the decision to mark the node as dead. This caused increased request rates to increase past the request rate quota to the backing DB, causing requests to get rejected at the DB, increasing the failure rate of the tweet service.

Mitigations / fixes:

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Michael Leinartas, Tao L., Michael Motherwell, Jonathan Riechhold, Stephan Zuercher, Justin Blank, Jamie Brandon, John Hergenroeder, and Ben Kuhn for comments/corrections/discussion.

Appendix: Pelikan cache

Pelikan was created to address issues we saw when operating memcached and Redis at scale. This document explains some of the motivations for Pelikan. The moduarlity / ease of modification has allowed us to discover novel cache innovations, such as a new eviction algorithm that addresses the problems we ran into with existing eviction algorithms.

With respect to the kinds of things discussed in this post, Pelikan has had more predictable performance, better median performance, and better performance in the tail than our existing caches when we've tested it in production, which means we get better reliaiblity and more capacity at a lower cost.


  1. That knowledge decays at a high rate isn't unique to Twitter. In fact, of all the companies I've worked at as a full-time employee, I think Twitter is the best at preserving knowledge. The chip company I worked at, Centaur, basically didn't believe in written documentation other than having comprehensive bug reports, so many kinds of knowledge became lost very quickly. Microsoft was almost as bad since, by default, documents were locked down and fairly need-to-know, so basically nobody other than perhaps a few folks with extremely broad permissions would even be able to dig through old docs to understand how things had come about.

    Google was a lot like Twitter is now in the early days, but as the company grew and fears about legal actions grew, especially after multiple embarrassing incidents when execs stated their intention to take unethical and illegal actions, things became more locked down, like Microsoft.

    [return]
  2. There's also some use of a Redis fork, but the average case performance is significantly worse and the performance in the tail is relatively worse than the average case performance. Also, it has higher operational burden at scale directly due to its design, which limits its use for us. [return]

Cocktail party ideas

2022-02-02 08:00:00

You don't have to be at a party to see this phenomenon in action, but there's a curious thing I regularly see at parties in social circles where people value intelligence and cleverness without similarly valuing on-the-ground knowledge or intellectual rigor. People often discuss the standard trendy topics (some recent ones I've observed at multiple parties are how to build a competitor to Google search and how to solve the problem of high transit construction costs) and explain why people working in the field today are doing it wrong and then explain how they would do it instead. I occasionally have good conversations that fit that pattern (with people with very deep expertise in the field who've been working on changing the field for years), but the more common pattern is that someone with cocktail-party level knowledge of a field will give their ideas on how the field can be fixed.

Asking people why they think their solutions would solve valuable problems in the field has become a hobby of mine when I'm at parties where this kind of superficial pseudo-technical discussion dominates the party. What I've found when I've asked for details is that, in areas where I have some knowledge, people generally don't know what sub-problems need to be solved to solve the problem they're trying to address, making their solution hopeless. After having done this many times, my opinion is that the root cause of this is generally that many people who have a superficial understanding of topic assume that the topic is as complex as their understanding of the topic instead of realizing that only knowing a bit about a topic means that they're missing an understanding of the full complexity of a topic.

Since I often attend parties with programmers, this means I often hear programmers retelling their cocktail-party level understanding of another field (the search engine example above notwithstanding). If you want a sample of similar comments online, you can often see these when programmers discuss "trad" engineering fields. An example I enjoyed was this Twitter thread where Hillel Wayne discussed how programmers without knowledge of trad engineering often have incorrect ideas about what trad engineering is like, where many of the responses are from programmers with little to no knowledge of trad engineering who then reply to Hillel with their misconceptions. When Hillel completed his crossover project, where he interviewed people who've worked in a trad engineering field as well as in software, he got even more such comments. Even when people are warned that naive conceptions of a field are likely to be incorrect, many can't help themselves and they'll immediately reply with their opinions about a field they know basically nothing about.

Anyway, in the crossover project, Hillel compared the perceptions of people who'd actually worked in multiple fields to pop-programmer perceptions of trad engineering. One of the many examples of this that Hillel gives is when people talk about bridge building, where he notes that programmers say things like

The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly. If we did not quickly adapt to the unforeseen, the only foreseeable event would be our own destruction.

and

No one thinks about moving the starting or ending point of the bridge midway through construction.

But Hillel interviewed a civil engineer who said that they had to move a bridge! Of course, civil engineers don't move bridges as frequently as programmers deal with changes in software but, if you talk to actual, working, civil engineers, many civil engineers frequently deal with changing requirements after a job has started that's not fundamentally different from what programmers have to deal with at their jobs. People who've worked in both fields or at least talk to people in the other field tend to think the concerns faced by engineers in both fields are complex, but people with a cocktail-party level of understanding of the field often claim that the field they're not in is simple, unlike their field.

A line I often hear from programmers is that programming is like "having to build a plane while it's flying", implicitly making the case that programming is harder than designing and building a plane since people who design and build planes can do so before the plane is flying1. But, of course, someone who designs airplanes could just as easily say "gosh, my job would be very easy if I could build planes with 4 9s of uptime and my plane were allowed to crash and kill all of the passengers for 1 minute every week". Of course, the constraints on different types of projects and different fields make different things hard, but people often seem to have a hard time seeing constraints other fields have that their field doesn't. One might think that understanding that their own field is more complex than an outsider might naively think would help people understand that other fields may also have hidden complexity, but that doesn't generally seem to be the case.

If we look at the rest of the statement Hillel was quoting (which is from the top & accepted answer to a stack exchange question), the author goes on to say:

It's much easier to make accurate projections when you know in advance exactly what you're being asked to project rather than making guesses and dealing with constant changes.

The vast majority of bridges are using extremely tried and true materials, architectures, and techniques. A Roman engineer could be transported two thousand years into the future and generally recognize what was going on at a modern construction site. There would be differences, of course, but you're still building arches for load balancing, you're still using many of the same materials, etc. Most software that is being built, on the other hand . . .

This is typical of the kind of error people make when they're discussing cocktail-party ideas. Programmers legitimately gripe when clueless execs who haven't been programmers for a decade request unreasonable changes to a project that's in progress, but this is not so different and actually more likely to be reasonable than when politicians who've never been civil engineers require project changes on large scale civil engineering projects. It's plausible that, on average, programming projects have more frequent or larger changes to the project than civil engineering projects, I'd guess that the intra-field variance is at least as large as the inter-field variance.

And, of course, only someone who hasn't done serious engineering work in the physical world could say something like "The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly", thinking that the (relative) fixity of physical laws means that physical work is predictable. When I worked as a hardware engineer, a large fraction of the effort and complexity of my projects went into dealing with physical uncertainty and civil engineering is no different (if anything, the tools civil engineers have to deal with physical uncertainty on large scale projects are much worse, resulting in a larger degree of uncertainty and a reduced ability to prevent delays due to uncertainty).

If we look at how Roman engineering or even engineering from 300 years ago differs from modern engineering, a major source of differences is our much better understanding of uncertainty that comes from the physical world. It didn't used to be shocking when a structure failed not too long after being built without any kind of unusual conditions or stimulus (e.g., building collapse, or train accident due to incorrectly constructed rail). This is now rare enough that it's major news if it happens in the U.S. or Canada and this understanding also lets us build gigantic structures in areas where it would have been previously considered difficult or impossible to build moderate-sized structures.

For example, if you look at a large-scale construction project in the Vancouver area that's sitting on the delta (Delta, Richmond, much of the land going out towards Hope), it's only relatively recently that we discovered the knowledge necessary to build some large scale structures (e.g., tall-ish buildings) reliably on that kind of ground, which is one of the many parts of modern civil engineering a Roman engineer wouldn't understand. A lot of this comes from a field called geotechnical engineering, a sub-field of civil engineering (alternately, arguably its own field and also arguably a subfield of geological engineering) that involves the ground, i.e., soil mechanics, rock mechanics, geology, hydrology, and so on and so forth. One fundamental piece of geotechnical engineering is the idea that you can apply mechanics to reason about soil. The first known application of mechanics to soils, a fundamental part of geotechnical engineering, was in 1773 and geotechnical engineering as it's thought of today is generally said to have started in 1925. While Roman engineers did a lot of impressive work, the mental models they were operating with precluded understanding much of modern civil engineering.

Naturally, for this knowledge to have been able to change what we can build, it must change how we build. If we look at what a construction site on compressible Vancouver delta soils that uses this modern knowledge looks like, by wall clock time, it mostly looks like someone put a pile of sand on the construction site (preload). While a Roman engineer would know what a pile of sand is, they wouldn't know how someone figured out how much sand was needed and how long it needed to be there (in some cases, Romans would use piles or rafts where we would use preload today, but in many cases, they had no answer to the problems preload solves today).

Geotechnical engineering and the resultant pile of sand (preload) is one of tens of sub-fields where you'd need expertise when doing a modern, large scale, civil engineering project that a Roman engineer would need a fair amount of education to really understand.

Coming back to cocktail party solutions I hear, one common set of solutions is how to fix high construction costs and slow construction. There's a set of trendy ideas that people throw around about why things are so expensive, why projects took longer than projected, etc. Sometimes, these comments are similar to what I hear from practicing engineers that are involved in the projects but, more often than not, the reasons are pretty different. When the reasons are the same, it seems that they must be correct by coincidence since they don't seem to understand the body of knowledge necessary to reason through the engineering tradeoffs2.

Of course, like cocktail party theorists, civil engineers with expertise in the field also think that modern construction is wasteful, but the reasons they come up with are often quite different from what I hear at parties3. It's easy to come up with cocktail party solutions to problems by not understanding the problem, assuming the problem is artificially simple, and then coming up with a solution to the imagined problem. It's harder to understand the tradeoffs in play among the tens of interacting engineering sub-fields required to do large scale construction projects and have an actually relevant discussion of what the tradeoffs should be and how one might motivate engineers and policy makers to shift where the tradeoffs land.

A widely cited study on the general phenomena of people having wildly oversimplified and incorrect models of how things work is this study by Rebecca Lawson on people's understanding of how bicycles work, which notes:

Recent research has suggested that people often overestimate their ability to explain how things function. Rozenblit and Keil (2002) found that people overrated their understanding of complicated phenomena. This illusion of explanatory depth was not merely due to general overconfidence; it was specific to the understanding of causally complex systems, such as artifacts (crossbows, sewing machines, microchips) and natural phenomena (tides, rainbows), relative to other knowledge domains, such as facts (names of capital cities), procedures (baking cakes), or narratives (movie plots).

And

It would be unsurprising if nonexperts had failed to explain the intricacies of how gears work or why the angle of the front forks of a bicycle is critical. Indeed, even physicists disagree about seemingly simple issues, such as why bicycles are stable (Jones, 1970; Kirshner, 1980) and how they steer (Fajans, 2000). What is striking about the present results is that so many people have virtually no knowledge of how bicycles function.​​

In "experiment 2" in the study, people were asked to draw a working bicycle and focus on the mechanisms that make the bicycle work (as opposed to making the drawing look nice) and 60 of the 94 participants had at least one gross error that caused the drawing to not even resemble a working bicycle. If we look at a large-scale real-world civil engineering project, a single relevant subfield, like geotechnical engineering, contains many orders of magnitude more complexity than a bicycle and it's pretty safe to guess that, to the nearest percent, zero percent of lay people (or Roman engineers) could roughly sketch out what the relevant moving parts are.

For a non-civil engineering example, Jamie Brandon quotes this excerpt from Jim Manzi's Uncontrolled, which is a refutation of a "clever" nugget that I've frequently heard trotted out at parties:

The paradox of choice is a widely told folktale about a single experiment in which putting more kinds of jam on a supermarket display resulted in less purchases. The given explanation is that choice is stressful and so some people, facing too many possible jams, will just bounce out entirely and go home without jam. This experiment is constantly cited in news and media, usually with descriptions like "scientists have discovered that choice is bad for you". But if you go to a large supermarket you will see approximately 12 million varieties of jam. Have they not heard of the jam experiment? Jim Manzi relates in Uncontrolled:

First, note that all of the inference is built on the purchase of a grand total of thirty-five jars of jam. Second, note that if the results of the jam experiment were valid and applicable with the kind of generality required to be relevant as the basis for economic or social policy, it would imply that many stores could eliminate 75 percent of their products and cause sales to increase by 900 percent. That would be a fairly astounding result and indicates that there may be a problem with the measurement.

... the researchers in the original experiment themselves were careful about their explicit claims of generalizability, and significant effort has been devoted to the exact question of finding conditions under which choice overload occurs consistently, but popularizers telescoped the conclusions derived from one coupon-plus-display promotion in one store on two Saturdays, up through assertions about the impact of product selection for jam for this store, to the impact of product selection for jam for all grocery stores in America, to claims about the impact of product selection for all retail products of any kind in every store, ultimately to fairly grandiose claims about the benefits of choice to society. But as we saw, testing this kind of claim in fifty experiments in different situations throws a lot of cold water on the assertion.

As a practical business example, even a simplification of the causal mechanism that comprises a useful forward prediction rule is unlikely to be much like 'Renaming QwikMart stores to FastMart will cause sales to rise,' but will instead tend to be more like 'Renaming QwikMart stores to FastMart in high-income neighborhoods on high-traffic roads will cause sales to rise, as long as the store is closed for painting for no more than two days.' It is extremely unlikely that we would know all of the possible hidden conditionals before beginning testing, and be able to design and execute one test that discovers such a condition-laden rule.

Further, these causal relationships themselves can frequently change. For example, we discover that a specific sales promotion drives a net gain in profit versus no promotion in a test, but next year when a huge number of changes occurs - our competitors have innovated with new promotions, the overall economy has deteriorated, consumer traffic has shifted somewhat from malls to strip centers, and so on - this rule no longer holds true. To extend the prior metaphor, we are finding our way through our dark room by bumping our shins into furniture, while unobserved gremlins keep moving the furniture around on us. For these reasons, it is not enough to run an experiment, find a causal relationship, and assume that it is widely applicable. We must run tests and then measure the actual predictiveness of the rules developed from these tests in actual implementation.

So far, we've discussed examples of people with no background in a field explaining how a field works or should work, but the error of taking a high-level view and incorrectly assuming that things are simple also happens when people step back and have a high-level view of their own field that's disconnected from the details. For example, back when I worked at Centaur and we'd not yet shipped a dual core chip, a nearly graduated PhD student in computer architecture from a top school asked me, "why don't you just staple two cores together to make a dual core chip like Intel and AMD? That's an easy win".

At that time, we'd already been working on going from single core to multi core for more than one year. Making a single core chip multi-core or even multi-processor capable with decent performance requires significant additional complexity to the cache and memory hierarchy, the most logically complex part of the chip. As a rough estimate, I would guess that taking a chip designed for single-core use and making it multi-processor capable at least doubles the amount of testing/verification effort required to produce a working chip (and the majority of the design effort that goes into a chip is on testing/verification). More generally, a computer architect is only as good as their understanding of the tradeoffs their decisions impact. Great ones have a strong understanding of the underlying fields they must interact with. A common reason that a computer architect will make a bad decision is that they have a cocktail party level understanding of the fields that are one or two levels below computer architecture. An example of a bad decision that's occurred multiple times in industry is when a working computer architect decides to add SMT to a chip because it's basically a free win. You pay a few percent extra area and get perhaps 20% better performance. I know of multple attempts to do this that completely failed for predictable reasons because the architect failed to account for the complexity and verification cost of adding SMT. Adding SMT adds much more complexity than adding a second core because the logic has to be plumbed through everything and it causes an explosion in the complexity of verifying the chip for the same reason. Intel famously added SMT to the P4 and did not enable in the first generation it was shipped in because it was too complex to verify in a single generation and had critical, showstopping, bugs. With the years of time they had to shake the bugs out on one generation of architecture, they fixed their SMT implementation and shipped it in the next generation of chips. This happened again when they migrated to the Core architecture and added SMT to that. A working computer architect should know that this happened twice to Intel, implying that verifying an SMT implementation is hard, and yet there have been multiple instances where someone had a cocktail party level of understanding of the complexity of SMT and suggested adding it to a design that did not have the verification budget to ever ship a working chip with SMT.

And, of course, this isn't really unique to computer architecture. I used the dual core example because it's one that happens to currently be top-of-mind for me, but I can think of tens of similar examples off the top of my head and I'm pretty sure I could write up a few hundred examples if I spent a few days thinking about similar examples. People working in a field still have to be very careful to avoid having an incorrect, too abstract, view of the world that elides details and draws comically wrong inferences or conclusions as a result. When people outside a field explain how things should work, their explanations are generally even worse than someone in the field who missed a critical consideration and they generally present crank ideas.

Bringing together the Roman engineering example and the CPU example, going from 1 core to 2 (and, in general, going from 1 to 2, as in 1 datacenter to 2 datacenters or a monolith to a distributed system) is something every practitioner should understand is hard, even if some don't. Somewhat relatedly, if someone showed off a 4 THz processor that had 1000x the performance of a 4 GHz processor, that's something any practitioner should recognize as alien technology that they definitely do not understand. Only a lay person with no knowledge of the field could reasonably think to themselves, "it's just a processor running at 1000x the clock speed; an engineer who can make a 4 GHz processor would basically understand how a 4 THz processor with 1000x the performance works". We are so far from being able to scale up performance by 1000x by running chips 1000x faster that doing so would require many fundamental breakthroughs in technology and, most likely, the creation of entirely new fields that contain more engineering knowledge than exists in the world today. Similarly, only a lay person could look at Roman engineering and modern civil engineering and think "Romans built things and we build things that are just bigger and more varied; a Roman engineer should be able to understand how we build things today because the things are just bigger". Geotechnical engineering alone contains more engineering knowledge than existed in all engineering fields combined in the Roman era and it's only one of the new fields that had to be invented to allow building structures like we can build today.

Of course, I don't expect random programmers to understand geotechnical engineering, but I would hope that someone who's making a comparison between programming and civil engineering would at least have some knowledge of civil engineering and not just assume that the amount of knowledge that exists in the field is roughly equal to their knowledge of the field when they know basically nothing about the field.

Although I seem to try a lot harder than most folks to avoid falling into the trap of thinking something is simple because I don't understand it, I still fall prey to this all the time and the best things I've come up with to prevent this, while better than nothing, are not reliable.

One part of this is that I've tried to cultivate noticing "the feeling of glossing over something without really understanding it". I think of this is analogous to (and perhaps it's actually the same thing as) something that's become trendy over the past twenty years, paying attention to how emotions feel in your body and understanding your emotional state by noticing feelings in your body, e.g., a certain flavor of tight feeling in a specific muscle is a sure sign that I'm angry.

There's a specific feeling I get in my body when I have a fuzzy, high-level, view of something and am mentally glossing over it. I can easily miss it if I'm not paying attention and I suspect I can also miss it when I gloss over something in a way where the non-conscious part of the brain that generates the feeling doesn't even know that I'm glossing over something. Although noticing this feeling is inherently unreliable, I think that everything else I might do that's self contained to check my own reasoning fundamentally relies on the same mechanism (e.g., if I have a checklist to try to determine if I haven't glossed over something when I'm reasoning about a topic, some part of that process will still rely on feeling or intuition). I do try to postmortem cases where I missed the feeling to figure out happened, and that's basically how I figured out that I have a feeling associated with this error in the first place (I thought about what led up to this class of mistake in the past and noticed that I have a feeling that's generally associated with it), but that's never going to perfect or even very good.

Another component is doing what I think of as "checking inputs into my head". When I was in high school, I noticed that a pretty large fraction of the "obviously wrong" things I said came from letting incorrect information into my head. I didn't and still don't have a good, cheap, way to tag a piece of information with how reliable it is, so I find it much easier to either fact-check or discard information on consumption.

Another thing I try to do is get feedback, which is unreliable and also intractable in the general case since the speed of getting feedback is so much slower than the speed of thought that slowing down general thought to the speed of feedback would result in having relatively few thoughts4.

Although, unlike in some areas, there's no mechanical, systematic, set of steps that can be taught that will solve the problem, I do think this is something that can be practiced and improved and there are some fields where similar skills are taught (often implicitly). For example, when discussing the prerequisites for an advanced or graduate level textbook, it's not uncommon to see a book say something like "Self contained. No prerequisites other than mathematical maturity". This is a shorthand way of saying "This book doesn't require you to know any particular mathematical knowledge that a high school student wouldn't have picked up, but you do need to have ironed out a kind of fuzzy thinking that almost every untrained person has when it comes to interpreting and understanding mathematical statements". Someone with a math degree will have a bunch of explicit knowledge in their head about things like Cauchy-Schwarz inequality and the Bolzano-Weierstrass theorem, but the important stuff for being able to understand the book isn't the explicit knowledge, but the general way one thinks about math.

Although there isn't really a term for the equivalent of mathematical maturity in other fields, e.g., people don't generally refer to "systems designs maturity" as something people look for in systems design interviews, the analogous skill exists even though it doesn't have a name. And likewise for just thinking about topics where one isn't a trained expert, like a non-civil engineer thinking about why a construction project cost what it did and took as long as it did, a sort of general maturity of thought5.

Thanks to Reforge - Engineering Programs and Flatirons Development for helping to make this post possible by sponsoring me at the Major Sponsor tier.

Also, thanks to Pam Wolf, Ben Kuhn, Yossi Kreinin, Fabian Giesen, Laurence Tratt, Danny Lynch, Justin Blank, A. Cody Schuffelen, Michael Camilleri, and Anonymous for comments/corrections discussion.

An anonymous blog reader gave this example of their own battle with cocktail party ideas:

Your most recent post struck a chord with me (again!), as I have recently learned that I know basically nothing about making things cold, even though I've been a low-temperature physicist for nigh on 10 years, now. Although I knew the broad strokes of cooling, and roughly how a dilution refrigerator works, I didn't appreciate the sheer challenge of keeping things at milliKelvin (mK) temperatures. I am the sole physicist on my team, which otherwise consists of mechanical engineers. We have found that basically every nanowatt of dissipation at the mK level matters, as does every surface-surface contact, every material choice, and so on.

Indeed, we can say that the physics of thermal transport at mK temperatures is well understood, and we can write laws governing the heat transfer as a function of temperature in such systems. They are usually written as P = aT^n. We know that different classes of transport have different exponents, n, and those exponents are well known. Of course, as you might expect, the difference between having 'hot' qubits vs qubits at the base temperature of the dilution refrigerator (30 mK) is entirely wrapped up in the details of exactly what value of the pre-factor a happens to be in our specific systems. This parameter can be guessed, usually to within a factor of 10, sometimes to within a factor of 2. But really, to ensure that we're able to keep our qubits cold, we need to measure those pre-factors. Things like type of fastener (4-40 screw vs M4 bolt), number of fasteners, material choice (gold? copper?), and geometry all play a huge role in the actual performance of the system. Oh also, it turns out n changes wildly as you take a metal from its normal state to its superconducting state. Fun!

We have spent over a year carefully modeling our cryogenic systems, and in the process have discovered massive misconceptions held by people with 15-20 years of experience doing low-temperature measurements. We've discovered material choices and design decisions that would've been deemed insane had any actual thermal modeling been done to verify these designs.

The funny thing is, this was mostly fine if we wanted to reproduce the results of academic labs, which mostly favored simpler experiment design, but just doesn't work as we leave the academic world behind and design towards our own purposes.

P.S. Quantum computing also seems to suffer from the idea that controlling 100 qubits (IBM is at 127) is not that different from 1,000 or 1,000,000. I used to think that it was just PR bullshit and the people at these companies responsible for scaling were fully aware of how insanely difficult this would be, but after my own experience and reading you post, I'm a little worried that most of them don't truly appreciate the titanic struggle ahead for us.

This is just a long-winded way of saying that I have held cocktail party ideas about a field in which I have a PhD and am ostensibly an expert, so your post was very timely for me. I like to use your writing as a springboard to think about how to be better, which has been very difficult. It's hard to define what a good physicist is or does, but I'm sure that trying harder to identify and grapple with the limits of my own knowledge seems like a good thing to do.

For a broader and higher-level discussion of clear thinking, see Julia Galef's Scout Mindset:

WHEN YOU THINK of someone with excellent judgment, what traits come to mind? Maybe you think of things like intelligence, cleverness, courage, or patience. Those are all admirable virtues, but there’s one trait that belongs at the top of the list that is so overlooked, it doesn’t even have an official name.

So I’ve given it one. I call it scout mindset: the motivation to see things as they are, not as you wish they were.

Scout mindset is what allows you to recognize when you are wrong, to seek out your blind spots, to test your assumptions and change course. It’s what prompts you to honestly ask yourself questions like “Was I at fault in that argument?” or “Is this risk worth it?” or “How would I react if someone from the other political party did the same thing?” As the late physicist Richard Feynman once said, “The first principle is that you must not fool yourself—and you are the easiest person to fool.”

As a tool to improve thought, the book has a number of chapters that give concrete checks that one can try, which makes it more (or at least more easily) actionable than this post, which merely suggests that you figure out what it feels like when you're glossing over something. But I don't think that the ideas in the book are a substitute for this post, in that the self-checks the book suggests don't directly attack the problem discussed in this post.

In one chapter, Galef suggests leaning into confusion (e.g., if some seemingly contradictory information gives rise to a feeling of confusion), which I agree with. I would add that there are a lot of other feelings that are useful to observe that don't really have a good name. When it comes to evaluating ideas, some that I try to note, beside the already mentioned "the feeling that I'm glossing over important details", are "the feeling that a certain approach is likely to pay off if pursued", "the feeling that an approach is really fraught/dangerous", "the feeling that there's critical missing information", "the feeling that something is really wrong", along with similar feelings that don't have great names.

For a discussion of how the movie Don't Look Up promotes the idea that the world is simple and we can easily find cocktail party solutions to problems, see this post by Scott Alexander.

Also, John Salvatier notes that reality has a surprising amount of detail.


  1. Another one I commonly hear is that, unlike trad engineers, programmers do things that have never been done before [return]
  2. Discussions about construction delays similarly ignore geotechnical reasons for delays. As with the above, I'm using geotechnical as an example of a sub-field that explains many delays because it's something I happen to be familiar with, not because it's the most important thing, but it is a major cause of delays and, on many kinds of projects, the largest cause of delays.

    Going back to our example that a Roman engineer might, at best, superficially understand, the reason that we pile dirt onto the ground before building is that much of Vancouver has poor geotechnical conditions for building large structures. The ground is soft and will get unevenly squished down over time if something heavy is built on top of it. The sand is there as a weight, to pre-squish the ground.

    As described in the paragraph above, this sounds straightforward. Unfortunately, it's anything but. As it happens, I've been spending a lot of time driving around with a geophysics engineer (a field that's related to but quite distinct from geotechnical engineering). When we drive over a funny bump or dip in the road, she can generally point out the geotechnical issue or politically motivated decision to ignore the geotechnical engineer's guidance that caused the bump to come into existence. The thing I find interesting about this is that, even though the level of de-risking done for civil engineering projects is generally much higher than is done for the electrical engineering projects I've worked on, where in turn it's much higher than on any software project I've worked on, enough "bugs" still make it into "production" that you can see tens or hundreds of mistakes in a day if you drive around, are knowledgeable, and pay attention.

    Fundamentally, the issue is that humanity does not have the technology to understand the ground at anything resembling a reasonable cost for physically large projects, like major highways. One tool that we have is to image the ground with ground penetrating radar, but this results in highly underdetermined output. Another tool we have is to use something like a core drill or soil augur, which is basically digging down into the ground to see what's there. This also has inherently underdetermined output because we only get to see what's going on exactly where we drilled and the ground sometimes has large spatial variation in its composition that's not obvious from looking at it from the surface. A common example is when there's an unmapped remnant creek bed, which can easily "dodge" the locations where soil is sampled. Other tools also exist, but they, similarly, leave the engineer with an incomplete and uncertain view of the world when used under practical financial constraints.

    When I listen to cocktail party discussions of why a construction project took so long and compare it to what civil engineers tell me caused the delay, the cocktail party discussion almost always exclusively discusses reasons that civil engineers tell me are incorrect. There are many reasons for delays and "unexpected geotechnical conditions" are a common one. Civil engineers are in a bind here since drilling cores is time consuming and expensive and people get mad when they see that the ground is dug up and no "real work" is happening (and likewise when preload is applied — "why aren't they working on the highway?"), which creates pressure on politicians which indirectly results in timelines that don't allow sufficient time to understand geotechnical conditions. This sometimes results in a geotechnical surprise during a project (typically phrased as "unforseen geotechnical conditions" in technical reports), which can result in major parts of a project having to switch to slower and more expensive techniques or, even worse, can necessitate a part of a project being redone, resulting in cost and schedule overruns.

    I've never heard a cocktail party discussion that discusses geotechnical reasons for project delays. Instead, people talk about high-level reasons that are plausible sounding to a lay person, but completely fabricated, reasons that are disconnected from reality. But if you want to discuss how things can be built more quickly and cheaply, "progress studies", etc., this cannot be reasonably done without having some understanding of the geotechnical tradeoffs that are in play (as well as the tradeoffs from other civil engineering fields we haven't discussed).

    [return]
  3. One thing we could do to keep costs under control is to do less geotechnical work and ignore geotechnical surprises up to some risk bound. Today, some of the "amount of work" done is determined by regulations and much of it is determined by case law, which gives a rough idea of what work needs to be done to avoid legal liability in case of various bad outcomes, such as a building collapse.

    If, instead of using case law and risk of liability to determine how much geotechnical derisking should be done, we compute this based on QALYs per dollar, at the margin, we seem to spend a very large amount of money geotechnical derisking compared to many other interventions.

    This is not just true of geotechnical work and is also true of other fields in civil engineering, e.g., builders in places like the U.S. and Canada do much more slump testing than is done in some countries that have a much faster pace of construction, which reduces the risk of a building's untimely demise. It would be both scandalous and a serious liability problem if a building collapsed because the builders of the building didn't do slump testing when they would've in the U.S. or Canada,, but buildings usually don't collapse even when builders don't do as much slump testing as tends to be done in the U.S. and Canada.

    Countries that don't build to standards roughly as rigorous as U.S. or Canadian standards sometimes have fairly recently built structures collapse in ways that would be considered shocking in the U.S. and Canada, but the number of lives saved per dollar is very small compared to other places the money could be spent. Whether or not we should change this with a policy decision is a more relevant discussion to building costs and timelines than the fabricated reasons I hear cocktail party discussions of construction costs, but I've never heard this or other concrete reasons for project cost brought up outside of civil engineering circles.

    Even if we just confine ourselves to work that's related to civil engineering as opposed to taking a broader, more EA-minded approach, and looking QALYs for all possible interventions, the tradeoff between resources spent on derisking during construction vs. resources spent derisking on an ongoing basis (inspections, maintenance, etc.), the relative resource levels weren't determined by a process that should be expected to produce anywhere near an optimal outcome.

    [return]
  4. Some people suggest that writing is a good intermediate step that's quicker than getting external feedback while being more reliable than just thinking about something, but I find writing too slow to be usable as a way to clarify ideas and, after working on identifying when I'm having fuzzy thoughts, I find that trying to think through an idea to be more reliable as well as faster. [return]
  5. One part of this that I think is underrated by people who have a self-image of "being smart" is where book learning and thinking about something is sufficient vs. where on-the-ground knowledge of the topic is necessary.

    A fast reader can read the texts one reads for most technical degrees in maybe 40-100 hours. For a slow reader, that could be much slower, but it's still not really that much time. There are some aspects of problems where this is sufficient to understand the problem and come up with good, reasonable, solutions. And there are some aspects of problems where this is woefully inefficient and thousands of hours of applied effort are required to really be able to properly understand what's going on.

    [return]

The container throttling problem

2021-12-18 08:00:00

This is an excerpt from an internal document David Mackey and I co-authored in April 2019. The document is excerpted since much of the original doc was about comparing possible approaches to increasing efficency at Twitter, which is mostly information that's meaningless outside of Twitter without a large amount of additional explanation/context.

At Twitter, most CPU bound services start falling over at around 50% reserved container CPU utilization and almost all services start falling over at not much more CPU utilization even though CPU bound services should, theoretically, be able to get higher CPU utilizations. Because load isn't, in general, evenly balanced across shards and the shard-level degradation in performance is so severe when we exceed 50% CPU utilization, this makes the practical limit much lower than 50% even during peak load events.

This document will describe potential solutions to this problem. We'll start with describing why we should expect this problem given how services are configured and how the Linux scheduler we're using works. We'll then look into case studies on how we can fix this with config tuning for specific services, which can result in a 1.5x to 2x increase in capacity, which can translate into $[redacted]M/yr to $[redacted]M/yr in savings for large services. While this is worth doing and we might get back $[redacted]M/yr to $[redacted]M/yr in TCO by doing this for large services, manually fixing services one at a time isn't really scalable, so we'll also look at how we can make changes that can recapture some of the value for most services.

The problem, in theory

Almost all services at Twitter run on Linux with the CFS scheduler, using CFS bandwidth control quota for isolation, with default parameters. The intention is to allow different services to be colocated on the same boxes without having one service's runaway CPU usage impact other services and to prevent services on empty boxes from taking all of the CPU on the box, resulting in unpredictable performance, which service owners found difficult to reason about before we enabled quotas. The quota mechanism limits the amortized CPU usage of each container, but it doesn't limit how many cores the job can use at any given moment. Instead, if a job "wants to" use more than that many cores over a quota timeslice, it will use more cores than its quota for a short period of time and then get throttled, i.e., basically get put to sleep, in order to keep its amortized core usage below the quota, which is disastrous for tail latency1.

Since the vast majority of services at Twitter use thread pools that are much larger than their mesos core reservation, when jobs have heavy load, they end up requesting and then using more cores than their reservation and then throttling. This causes services that are provisioned based on load test numbers or observed latency under load to over provision CPU to avoid violating their SLOs. They either have to ask for more CPUs per shard than they actually need or they have to increase the number of shards they use.

An old example of this problem was the JVM Garbage Collector. Prior to work on the JVM to make the JVM container aware, each JVM would default the GC parallel thread pool size to the number of cores on the machine. During a GC, all these GC threads would run simultaneously, exhausting the cpu quota rapidly causing throttling. The resulting effect would be that a subsecond stop-the-world GC pause could take many seconds of wallclock time to complete. While the GC issue has been fixed, the issue still exists at the application level for virtually all services that run on mesos.

The problem, in practice [case study]

As a case study, let's look at service-1, the largest and most expensive service at Twitter.

Below is the CPU utilization histogram for this service just as it starts failing its load test, i.e., when it's just above the peak load the service can handle before it violates its SLO. The x-axis is the number of CPUs used at a given point in time and the y-axis is (relative) time spent at that utilization. The service is provisioned for 20 cores and we can see that the utilization is mostly significantly under that, even when running at nearly peak possible load:

Histogram for service with 20 CPU quota showing that average utilization is much lower but peak utilization is significantly higher when the service is overloaded and violates its SLO

The problem is the little bars above 20. These spikes caused the job to use up its CPU quota and then get throttled, which caused latency to drastically increase, which is why the SLO was violated even though average utilization is about 8 cores, or 40% of quota. One thing to note is that the sampling period for this graph was 10ms and the quota period is 100ms, so it's technically possible to see an excursion above 20 in this graph without throttling, but on average, if we see a lot of excursions, especially way above 20, we'll likely get throttling.

After reducing the thread pool sizes to avoid using too many cores and then throttling, we got the following CPU utilization histogram under a load test:

Histogram for service with 20 CPU quota showing that average utilization is much lower but peak utilization is significantly higher when the service is overloaded and violates its SLO

This is at 1.6x the load (request rate) of the previous histogram. In that case, the load test harness was unable to increase load enough to determine peak load for service-1 because the service was able to handle so much load before failure that the service that's feeding it during the load test couldn't keep it and send more load (although that's fixable, I didn't have the proper permissions to quickly fix it). [later testing showed that the service was able to handle about 2x the capacity after tweaking the thread pool sizes]

This case study isn't an isolated example — Andy Wilcox has looked at the same thing for service-2 and found similar gains in performance under load for similar reasons.

For services that are concerned about latency, we can get significant latency gains if we prefer to get latency gains instead of cost reduction. For service-1, if we leave the provisioned capacity the same instead of cutting by 2x, we see a 20% reduction in latency.

The gains for doing this for individual large services are significant (in the case of service-1, it's [mid 7 figures per year] for the service and [low 8 figures per year] including services that are clones of it, but tuning every service by hand isn't scalable. That raises the question: how many services are impacted?

Thread usage across the fleet

If we look at the number of active threads vs. number of reserved cores for moderate sized services (>= 100 shards), we see that almost all services have many more threads that want to execute than reserved cores. It's not uncommon to see tens of runnable threads per reserved core. This makes the service-1 example, above, look relatively tame, at 1.5 to 2 runnable threads per reserved core under load.

If we look at where these threads are coming from, it's common to see that a program has multiple thread pools where each thread pool is sized to either twice the number of reserved cores or twice the number of logical cores on the host machine. Both inside and outside of Twitter, It's common to see advice that thread pool size should be 2x the number of logical cores on the machine. This advice probably comes from a workload like picking how many threads to use for something like a gcc compile, where we don't want to have idle resources when we could have something to do. Since threads will sometimes get blocked and have nothing to do, going to 2x can increase throughput over 1x by decreasing the odds that any core is every idle, and 2x is a nice, round, number.

However, there are a few problems with applying this to Twitter applications:

  1. Most applications have multiple, competing, thread pools
  2. Exceeding the reserved core limit is extremely bad
  3. Having extra threads working on computations can increase latency

The "we should provision 2x the number of logical cores" model assumes that we have only one main thread pool doing all of the work and that there's little to no downside to having threads that could do work sit and do nothing and that we have a throughput oriented workload where we don't care about the deadline of any particular unit of work.

With the CFS scheduler, threads that have active work that are above the core reservation won't do nothing, they'll get scheduled and run, but this will cause throttling, which negatively impacts tail latency.

Potential Solutions

Given that we see something similar looking to our case study on many services and that it's difficult to push performance fixes to a lot of services (because service owners aren't really incentivized to take performance improvements), what can we do to address this problem across the fleet and just on a few handpicked large services? We're going to look at a list of potential solutions and then discuss each one in more detail, below.

Better defaults for cross-fleet threadpools

Potential impact: some small gains in efficiency
Advantages: much less work than any comprehensive solution, can be done in parallel with more comprehensive solutions and will still yield some benefit (due to reduced lock contention and context switches) if other solutions are in place.
Downsides: doesn't solve most of the problem.

Many defaults are too large. Netty default threadpool size is 2x the reserved cores. In some parts of [an org], they use a library that spins up eventbus and allocates a threadpool that's 2x the number of logical cores on the host (resulting in [over 100] eventbus threads) when 1-2 threads is sufficient for most of their eventbus use cases.

Adjusting these default sizes won't fix the problem, but it will reduce the impact of the problem and this should be much less work than the solutions below, so this can be done while we work on a more comprehensive solution.

Negotiating ThreadPool sizes via a shared library (API)

[this section was written by Vladimir Kostyukov]

Potential impact: can mostly mitigate the problem for most services.
Advantages: quite straightforward to design and implement; possible to make it first-class in Finagle/Finatra.
Downsides: Requires service-owners to opt-in explicitly (adopt a new API for constructing thread-pools).

CSL’s util library has a package that bridges in some integration points between an application and a JVM (util-jvm), which could be a good place to host a new API for negotiating the sizes of the thread pools required by the application.

The look and feel of such API is effectively dictated by how granular the negotiation is needed to be. Simply contending on a total number of allowed threads allocated per process, while being easy to implement, doesn’t allow distinguishing between application and IO threads. Introducing a notion of QoS for threads in the thread pool (i.e., “IO thread; can not block”, “App thread; can block”), on the other hand, could make the negotiation fine grained.

CFS Period Tuning

Potential impact: small reduction tail latencies by shrinking the length of the time period before the process group’s CFS runtime quota is refreshed.
Advantages: relatively straightforward change requiring few minimal changes.
Downsides: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota. May result in more total throttling.

To limit CPU usage, CFS operates over a time window known as the CFS period. Processes in a scheduling group take time from the CFS quota assigned to the cgroup and this quota is consumed over the cfs_period_us in CFS bandwidth slices. By shrinking the CFS period, the worst case time between quota exhaustion causing throttling and the process group being able to run again is reduced proportionately. Taking the default values of a CFS bandwidth slice of 5ms and CFS period of 100ms, in the worst case, a highly parallel application could exhaust all of its quota in the first bandwidth slice leaving 95ms of throttled time before any thread could be scheduled again.

It's possible that total throttling would increase because the scheduled time over 100ms might not exceed the threshold even though there are (for example) 5ms bursts that exceed the threshold.

CFS Bandwidth Slice Tuning

Potential impact: small reduction in tail latencies by allowing applications to make better use of the allocated quota.
Advantages: relatively straightforward change requiring minimal code changes.
Downsides: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.

When CFS goes to schedule a process it will transfer run-time between a global pool and CPU local pool to reduce global accounting pressure on large systems.The amount transferred each time is called the "slice". A larger bandwidth slice is more efficient from the scheduler’s perspective but a smaller bandwidth slice allows for more fine grained execution. In debugging issues in [link to internal JIRA ticket] it was determined that if a scheduled process fails to consume its entire bandwidth slice, the default slice size being 5ms, because it has completed execution or blocked on another process, this time is lost to the process group reducing its ability to consume all available resources it has requested.

The overhead of tuning this value is expected to be minimal, but should be measured. Additionally, it is likely not a one size fits all tunable, but exposing this to the user as a tunable has been rejected in the past in Mesos. Determining a heuristic for tuning this value and providing a per application way to set it may prove infeasible.

Other Scheduler Tunings

Potential Impact: small reduction in tail latencies and reduced throttling.
Advantages: relatively straightforward change requiring minimal code changes.
Downsides: comes at potentially increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.

The kernel has numerous auto-scaling and auto-grouping features whose impact to scheduling performance and throttling is currently unknown. kernel.sched_tunable_scaling can adjust kernel.sched_latency_ns underneath our understanding of its value. kernel.sched_min_granularity_ns and kernel.sched_wakeup_granularity_ns can be tuned to allow for preempting sooner, allowing better resource sharing and minimizing delays. kernel.sched_autogroup_enabled may currently not respect kernel.sched_latency_nsleading to more throttling challenges and scheduling inefficiencies. These tunables have not been investigated significantly and the impact of tuning them is unknown.

CFS Scheduler Improvements

Potential impact: better overall cpu resource utilization and minimized throttling due to CFS inefficiencies.
Advantages: improvements are transparent to userspace.
Downsides: the CFS scheduler is complex so there is a large risk to the success of the changes and upstream reception to certain types of modifications may be challenging.

How the CFS scheduler deals with unused slack time from the CFS bandwidth slice has shown to be ineffective. The kernel team has a patch to ensure that this unused time is returned back to the global pool for other processes to use, https://lore.kernel.org/patchwork/patch/907450/ to ensure better overall system resource utilization. There are some additional avenues to explore that could provide further enhancements. Another of many recent discussions in this area that fell out of a k8s throttling issue(https://github.com/kubernetes/kubernetes/issues/67577) is https://lkml.org/lkml/2019/3/18/706.

Additionally, CFS may lose efficiency due to bugs such as [link to internal JIRA ticket] and http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf. However, we haven't spent much time looking at the CFS performance for Twitter’s particular use cases. A closer look at CFS may find ways to improve efficiency.

Another change which has more upside and downside potential would be to use a scheduler other than CFS.

CPU Pinning and Isolation

Potential impact: removes the concept of throttling from the system by making the application developer’s mental model of a CPU map to a physical one.
Advantages: simplified understanding from application developer’s perspective, scheduler imposed throttling is no longer a concept an application contends with, improved cache efficiency, much less resource interference resulting in more deterministic performance.
Disadvantages: greater operational complexity, oversubscription is much more complicated, significant changes to current operating environment

The fundamental issue that allows throttling to occur is that a heavily threaded application can have more threads executing in parallel than the “number of CPUs” it requested resulting in an early exhaustion of available runtime. By restricting the number of threads executing simultaneously to the number of CPUs an application requested there is now a 1:1 mapping and an application’s process group is free to consume the logical CPU thread unimpeded by the scheduler. Additionally, by dedicating a CPU thread rather than a bandwidth slice to the application, the application is now able to take full advantage of CPU caching benefits without having to contend with other applications being scheduled on the same CPU thread while it is throttled or context switched away.

In Mesos, implementing CPU pinning has proven to be quite difficult. However, in k8s there is existing hope in the form of a project from Intel known as the k8s CPU Manager. The CPU Manager was added as an alpha feature to k8s in 1.8 and has been enabled as a beta feature since 1.10. It has somewhat stalled in beta as few people seem to be using it but the core functionality is present. The performance improvements promoted by the CPU Manager project are significant as shown in examples such as https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ and https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf While these benchmarks should be looked at with some skepticism, it does provide promising hope for exploring this avenue. A cursory inspection of the project highlights a few areas where work may still be needed but it is already in a usable state for validating the approach. Underneath, the k8s CPU Manager leverages the cpuset cgroup functionality that is present in the kernel.

Potentially, this approach does reduce the ability to oversubscribe the machines. However, the efficiency gains from minimized cross-pod interference, CPU throttling, a more deterministic execution profile and more may offset the need to oversubscribe. Currently, the k8s CPU Manager does allow for minor oversubscription in the form of allowing system level containers and the daemonset to be oversubscribed, but on a pod scheduling basis the cpus are reserved for that pod’s use.

Experiments by Brian Martin and others have shown significant performance benefits from CPU pinning that are almost as large as our oversubscription factor.

Longer term, oversubscription could be possible through a multitiered approach of wherein a primary class of pods is scheduled using CPU pinning but a secondary class of pods that is not as latency sensitive is allowed to float across all cores consuming slack resources from the primary pods. The work on the CPU Manager side would be extensive. However, recently Facebook has been doing some work on the kernel scheduler side to further enable this concept in a way that minimally impacts the primary pod class that we can expand upon or evolve.

Oversubscription at the cluster scheduler level

Potential impact: can bring machine utilization up to an arbitrarily high level and overprovisioning "enough".
Advantages: oversubscription at the cluster scheduler level is independent of the problem described in this doc; doing it in a data-driven way can drive machine utilization up without having to try to fix the specific problems described here. This could simultaneously fix the problem in this doc (low CPU utilization due to overprovisioning to avoid throttling) while also fixing [reference to document describing another problem].
Disadvantages: we saw in [link to internal doc] that shards of services running on hosts with high load have degraded performance. Unless we change the mesos scheduler to schedule based on actual utilization (as opposed to reservation), some hosts would end up too highly loaded and services with shards that land on those hosts would have poor performance.

Disable CFS quotas

Potential impact: prevents throttling and allows services to use all available cores on a box by relying on the "shares" mechanism instead of quota.
Advantages: in some sense, can gives us the highest possible utilization.
Disadvantages: badly behaved services could severely interfere with other services running on the same box. Also, service owners would have a much more difficult time predicting the performance of their own service since performance variability between the unloaded and loaded state would be much larger.

This solution is what was used before we enabled quotas. From a naive hardware utilization standpoint, relying on the shares mechanism seems optimal since this means that, if the box is underutilized, services can take unused cores, but if the box becomes highly utilized, services will fall back to taking their share of cores, proportional to their core reseration. However, when we used this system, most service owners found it too difficult to estimate performance under load for this to be practical. At least one company has tried this solution to fix their throttling problem and has had severe incidents under load because of it. If we switched back to this today, we'd be no better off than we were before we were before we enabled quotes.

Given how we allocate capacity, two ingredients that would make this work better than it did before include having a more carefully controlled request rate to individual shards and a load testing setup that allowed service owners to understand what things would really look like during a load spike, as opposed to our system, which only allows injection of unrealistic load to individual shards, which both has the problem that the request mix isn't the same as it is under a real load spike and that the shard with injected load isn't seeing elevated load from other services running on the same box. Per [another internal document], we know that one of the largest factors impacting shard-level performance is overall load on the box and that the impact on latency is non-linear and difficult to predict, so there's not really a good way to predict performance under actual load from performance under load tests with the load testing framework we have today.

Although these missing ingredients are important, high impact, issues, addressing either of these issues is beyond the scope of this doc; [Team X] owns load testing and is working on load testing and it might be worth revisiting this when the problem is solved.

An intermediate solution would be to set the scheduler quota to a larger value than the number of reserved cores in mesos, which would bound the impact of having "too much" CPU available causing unpredictable performance while potentially reducing throttling when under high load because the scheduler will effective fall back to the shares mechanism if the box is highly loaded. For example, if the cgroup quota was twice the the mesos quota, services that fall over at 50% of reserved mesos CPU usage would then instead fall over at 100% of reserved mesos CPU usage. For boxes at high load, the higher overall utilization would reduce throttling because the increased load from other cores would mean that a service that has too many runnable threads wouldn't be able to have as many of those threads execute. This has a weaker version of the downside of disabling in quota, in that, from [internal doc], we know that load on a box from other services is one of the largest factors in shard-level performance variance and this would, if we don't change how many mesos cores are reserved on a box, increase load on boxes. And if we do proportionately decrease the number of mesos reserved cores on a box, that makes the change pointless in that it's equivalent to just doubling every service's CPU reservation, except that having it "secretly" doubled would probably reduce the number of people who ask the question, "Why can't I exceed X% CPU in load testing without the service falling over?"

Results

This section was not in the original document from April 2019; it was written in December 2021 and describes work that happened as a result of the original document.

The suggestion of changing default thread pool sizes was taken and resulted in minor improvements. More importantly, two major efforts came out of the document. Vladimir Kostyukov (from the CSL team) and Flavio Brasil (from the JVM team) created Finagle Offload Filter and Xi Yang (my intern2 at the time and now a full-time employee for my team) created a kernel patch which eliminates container throttling (the patch is still internal, but will hopefully eventually upstreamed).

Almost all applications that run on mesos at Twitter run on top of Finagle. The Finagle Offload Filter makes it trivial for service owners to put application work onto a different thread pool than IO (which was often not previously happening). In combination with sizing thread pools properly, this resulted in, ceteris paribus, applications having drastically reduced latency, enabling them to reduce their provisioned capacity and therefore their cost while meeting their SLO. Depending on the service, this resulted in a 15% to 60% cost reduction for the service.

The kernel patch implements the obvious idea of preventing containers from using more cores than a container's quota at every moment instead of allowing a container to use as many cores as are available on the machine and then putting the container to sleep if it uses too many cores to bring its amortized core usage down.

In experiments on hosts running major services at Twitter, this has the expected impact of eliminating issues related to throttling, giving a roughly 50% cost reduction for a typical service with untuned thread pool sizes. And it turns out the net impact is larger than we realized when we wrote this document due to the reduction in interference caused by preventing services from using "too many" cores and then throttling3. Also, although this was realized at the time, we didn't note in the document that the throttling issue causes shards to go from "basically totally fine" to a "throttling death spiral" that's analogous to a "GC death spiral" with only a small amount of additional load, which increases the difficulty of operating systems reliably. What happens is that, when a service is under high load, it will throttle. Throttling doesn't prevent requests from coming into the shard that's throttled, so when the shard wakes up from being throttled, it has even more work to do than it had before it throttled, causing it to use even more CPU and throttle more quickly, which causes even more work to pile up. Finagle has a mechanism that can shed load for shards that are in very bad shape (clients that talk to the dead server will mark the server as dead and stop sending request for a while) but, shards tend to get into this bad state when overall load to the service is high, so marking a node as dead just means that more load goes to other shards, which will then "want to" enter a throttling death spiral. Operating in a regime where throttling can cause a death spiral is an inherently metastable state. Removing both of these issues is arguably as large an impact as the cost reduction we see from eliminating throttling.

Xi Yang has experimented with variations on the naive kernel scheduler change mentioned above, but even the naive change seems to be quite effective compared to no change, even though the naive change does mean that services will often not be able to hit their full CPU allocation when they ask for it, e.g., if a service requests no CPU for the first half a period and then requests infinite CPU for the second half of the period, under the old system, it would get its allocated amount of CPU for the period, but under the new system, it would only get half. Some of Xi's variant patches address this issue in one way or another, but that has a relatively small impact compared to preventing throttling in the first place.

An independent change Pratik Tandel drove that reduced the impact of throttling on services by reducing the impact of variance between shards was to move to fewer larger shards. The main goal for that change was to reduce overhead due to duplicate work/memory that happens across all shards, but it also happens to have an impact due to larger per-shard quotas reducing the impact of random noise. Overall, this resulted in 0% to 20% reduced CPU usage and 10% to 40% reduced memory usage of large services at Twitter, depending on the service.

Thanks to Xi Yang, Ilya Pronin, Ian Downes, Rebecca Isaacs, Brian Martin, Vladimir Kotsyukov, Moses Nakamura, Flavio Brasil, Laurence Tratt, Akshay Shah, Julian Squires, Michael Greenberg @synrotek, and Miguel Angel Corral for comments/corrections/discussion


  1. if this box is highly loaded, because there aren't enough cores to go around, then a container may not get all of the cores it requests, but this doesn't change the fundamental problem. [return]
  2. I often joke that interns get all of the most interesting work, while us full-time employees are stuck with the stuff interns don't want to do. [return]
  3. In an independent effort, Matt Tejo found that, for a fixed average core utilization, services that throttle cause a much larger negative impact on other services on the same host than services that use a constant number of cores. That's because a service that's highly loaded and throttling toggles between attempting to use all of the cores on the box and then using none of the cores on the box, causing an extremely large amount of interference during the periods where it's attempting to use all of the cores on the box. [return]

Some thoughts on writing

2021-12-13 08:00:00

I see a lot of essays framed as writing advice which are actually thinly veiled descriptions of how someone writes that basically say "you should write how I write", e.g., people who write short posts say that you should write short posts. As with technical topics, I think a lot of different things can work and what's really important is that you find a style that's suitable to you and the context you operate in. Copying what's worked for someone else is unlikely to work for you, making "write how I write" bad advice.

We'll start by looking at how much variety there's been in what's worked1 for people, come back to what makes it so hard to copy someone else's style, and then discuss what I try to do in my writing.

If I look at the most read programming blogs in my extended social circles2 from 2000 to 20173, it's been Joel Spolsky, Paul Graham, Steve Yegge, and Julia Evans (if you're not familiar with these writers, see the appendix for excerpts that I think are representative of their styles). Everyone on this list has a different style in the following dimensions (as well as others):

To pick a simple one to quantify, length, Julia Evans and I both started blogging in 2013 (she has one post from 2012, but she's told me that she considers her blog to have started in earnest when she was at RC, in September 2013, the same month I started blogging). Over the years, we've compared notes a number of times and, until I paused blogging at the end of 2017, we had a similar word count on our blogs even though she was writing roughly one order of magnitude more posts than I do.

To look at a few aspects that are difficult to quantify, consider this passage from Paul Graham, which is typical of his style:

What nerds like is the kind of town where people walk around smiling. This excludes LA, where no one walks at all, and also New York, where people walk, but not smiling. When I was in grad school in Boston, a friend came to visit from New York. On the subway back from the airport she asked "Why is everyone smiling?" I looked and they weren't smiling. They just looked like they were compared to the facial expressions she was used to.

If you've lived in New York, you know where these facial expressions come from. It's the kind of place where your mind may be excited, but your body knows it's having a bad time. People don't so much enjoy living there as endure it for the sake of the excitement. And if you like certain kinds of excitement, New York is incomparable. It's a hub of glamour, a magnet for all the shorter half-life isotopes of style and fame.

Nerds don't care about glamour, so to them the appeal of New York is a mystery.

It uses multiple aspects of what's sometimes called classic style. In this post, when I say "classical style", I mean as the term is used by Thomas & Turner, not a colloquial meaning. What that means is really too long to reasonably describe in this post, but I'll say that one part of it is that the prose is clean, straightforward, and simple; an editor whose slogan is "omit needless words" wouldn't have many comments. Another part is that the clean-ness of the style goes past the prose to what information is presented, so much so that supporting evidence isn't really presented. Thomas & Turner say "truth needs no argument but only accurate presentation". An example that exemplifies both of these is this passage from Rochefoucauld:

Madame de Chevreuse had sparkling intelligence, ambition, and beauty in plenty; she was flirtatious, lively, bold, enterprising; she used all her charms to push her projects to success, and she almost always brought disaster to those she encountered on her way.

Thomas & Turner said this about Rochefoucauld's passage:

This passage displays truth according to an order that has nothing to do with the process by which the writer came to know it. The writer takes the pose of full knowledge. This pose implies that the writer has wide and textured experience; otherwise he would not be able to make such an observation. But none of that personal history, personal experience, or personal psychology enters into the expression. Instead the sentence crystallizes the writer’s experience into a timeless and absolute sequence, as if it were a geometric proof.

Much of this applies to the passage by Paul Graham (though not all, since he tells us an anecdote about a time a friend visited Boston from New York and he explicitly says that you would know such and such "if you've lived in New York" instead just stating what you would know).

My style is opposite in many ways. I often have long, meandering, sentences, not for any particular literary purpose, but just because it reflects how I think. Strunk & White would have a field day with my writing. To the extent feasible, I try to have a structured argument and, when possible, evidence, with caveats for cases where the evidence isn't applicable. Although not presenting evidence makes something read cleanly, that's not my choice because I don't like that the reader basically has to take or leave it with respect to bare assertions, such as "what nerds like is the kind of town where people walk around smiling" and would prefer if readers know why I think something so they can agree or disagree based on the underlying reasons.

With length, style, and the other dimensions mentioned, there isn't a right way and a wrong way. A wide variety of things can work decently well. Though, if popularity is the goal, then I've probably made a sub-optimal choice on length compared to Julia and on prose style when compared to Paul. If I look at what causes other people to gain a following, and what causes my RSS to get more traffic, for me to get more Twitter followers, etc., publishing short posts frequently looks more effective than publishing long posts less frequently.

I'm less certain about the impact of style on popularity, but my feeling is that, for the same reason that making a lot of confident statements at a job works (gets people promoted), writing confident, unqualified, statements, works (gets people readers). People like confidence.

But, in both of these cases, one can still be plenty popular while making a sub-optimal choice and, for me, I view optimizing for other goals to be more important than optimizing for popularity. On length, I frequently cover topics that can't be covered in brief easily, or perhaps at all. One example of this is my post on branch prediction, which has two goals: give a programmer with no background in branch prediction or even computer architecture a historical survey and teach them enough to be able to read and understand a modern, state-of-the-art paper on branch prediction. That post comes in at 5.8k words. I don't see how to achieve the same goals with a post that comes in at the lengths that people recommend for blog posts, 500 words, 1000 words, 1500 words, etc. The post could probably be cut down a bit, but every predictor discussed is either a necessary building block used to explain later predictors except the agree predictor or of historical importance. But if the agree predictor wasn't discussed, it would still be important to discuss at least one interference-reducing scheme since why interference occurs and what can be done to reduce it is a fundamental concept in branch prediction.

There are other versions of the post that could work. One that explains that branch prediction exists at all could probably be written in 1000 words. That post, written well, would have a wider audience, be more popular, but that's not what I want to write.

I have an analogous opinion on style because I frequently want to discuss things in a level of detail and with a level of precision that precludes writing cleanly in the classic style. A specific, small, example is that, on a recent post, a draft reader asked me to remove a double negative and I declined because, in that case, the double negative had different connotations from the positive statement that might've replaced it and I had something precise I wanted to convey that isn't what would've been conveyed if I simplified the sentence.

A more general thing is that Paul writes about a lot of "big ideas" at a high level. That's something that's amenable to writing in a clean, simple style; what Paul calls an elegant style. But I'm not interested in writing about big ideas that are disconnected from low-level details and it's difficult to effectively discuss low-level details without writing in a style Paul would call inelegant.

A concrete example of this is my discussion of command line tools and the UNIX philosophy. Should we have tools that "do one thing and do it well" and "write programs to handle text streams, because that is a universal interface" or use commands that have many options and can handle structured data? People have been trading the same high-level rebuttals back and forth for decades. But the moment we look at the details, look at what happens when these ideas get exposed to the real world, we can immediately see that one of these sets of ideas couldn't possibly work as espoused.

Coming back to writing style, if you're trying to figure out what stylistic choices are right for you, you should start from your goals and what you're good at and go from there, not listen to somebody who's going to tell you to write like them. Besides being unlikely to work for you even if someone is able to describe what makes their writing tick, most advice is written by people who don't understand how their writing works. This may be difficult to see for writing if you haven't spent a lot of time analyzing writing, but it's easy to see this is true if you've taken a bunch of dance classes or had sports instruction that isn't from a very good coach. If you watch, for example, the median dance instructor and listen to their instructions, you'll see that their instructions are quite different from what they actually do. People who listen and follow instructions instead of attempting to copy what the instructor is doing will end up doing the thing completely wrong. Most writing advice similarly fails to capture what's important.

Unfortunately, copying someone else's style isn't easy either; most people copy entirely the wrong thing. For example, Natalie Wynn noted that people who copy her style often copy the superficial bits without understanding what's driving the superficial bits to be the way they are:

One thing I notice is when people aren’t saying anything. Like when someone’s trying to do a “left tube video essay” and they shove all this opulent shit onscreen because contrapoints, but it has nothing to do with the topic. What’s the reference? What are you saying??

I made a video about shame, and the look is Eve in Eden because Eve was the first person to experience shame. So the visual is connected to the concept and hopefully it resonates more because of that. So I guess that’s my advice, try to say something

If you look into what people who excel in their field have to say, you'll often see analogous remarks about other fields. For example, in Practical Shooting, Rob Leatham says:

What keeps me busy in my classes is trying to help my students learn how to think. They say, "Rob holds his hands like this...," and they don't know that the reason I hold my hands like this is not to make myself look that way. The end result is not to hold the gun that way; holding the gun that way is the end result of doing something else.

And Brian Enos says:

When I began ... shooting I had only basic ideas about technique. So I did what I felt was the logical thing. I found the best local shooter (who was also competitive nationally) and asked him how I should shoot. He told me without hesitation: left index finger on the trigger guard, left elbow bent and pulling back, classix boxer stance, etcetera, etcetera. I adopted the system blindly for a year or two before wondering whether there might be a system that better suited my structure and attitude, and one that better suited the shooting. This first style that I adopted didn't seem to fit me because it felt as though I was having to struggle to control the gun; I was never actually flowing with the gun as I feel I do now. My experimentation led me to pull ideas from all types of shooting styles: Isosceles, Modified Weaver, Bullseye, and from people such as Bill Blankenship, shotgunner John Satterwhite, and martial artist Bruce Lee.

But ideas coming from your environment only steer you in the right direction. These ideas can limit your thinking by their very nature ... great ideas will arise from a feeling within yourself. This intuitive awareness will allow you to accept anything that works for you and discard anything that doesn't

I'm citing those examples because they're written up in a book, but I've heard basically the same comment from instructors in a wide variety of activities, e.g., dance instructors I've talked to complain that people will ask about whether, during a certain motion, the left foot should cross in front or behind the right foot, which is missing the point since what matters is the foot placement is reasonable given how the person's center of gravity is moving, which may mean that the foot should cross in front or behind, depending on the precise circumstance.

The more general issue is that a person who doesn't understand the thing they're trying to copy will end up copying unimportant superficial aspects of what somebody else is doing and miss the fundamentals that drive the superficial aspects. This even happens when there are very detailed instructions. Although watching what other people do can accelerate learning, especially for beginners who have no idea what to do, there isn't a shortcut to understanding something deeply enough to facilitate doing it well that can be summed up in simple rules, like "omit needless words"4.

As a result, I view style as something that should fall out of your goals, and goals are ultimately a personal preference. Personally, some goals that I sometimes have are:

When you combine one of those goals with the preference of discussing things in detail, you get a style that's different from any of the writers mentioned above, even if you want to use humor as effectively as Steve Yegge, write for as broad an audience as Julia Evans, or write as authoritatively as Paul Graham.

When I think about major components of my writing, the major thing that I view as driving how I write besides style & goals is process. As with style, I view this as something where a wide variety of things can work, where it's up to you to figure out what works for you.

For myself, I had the following process goals when I started my blog:

The low-up front investment goal is because, when I surveyed blogs I'd seen, one of the most common blog formats was a blog that contained a single post explaining that person was starting a blog, perhaps with another post explaining how their blog was set up, with no further posts. Another common blog format were blogs that had regular posts for a while, followed by a long dormant period with a post at the end explaining that they were going to start posting again, followed by no more posts (in some cases, there are a few such posts, with more time between each). Given the low rate of people continuing to blog after starting a blog, I figured I shouldn't bother investing in blog infra until I knew I was going to write for a while so, even though I already owned this domain name, I didn't bother figuring out how to point this domain at github pages and just set up a default install of some popular blogging software and I didn't even bother doing that until I had already written a post. In retrospect, it was a big mistake to use Octopress (Jekyll); I picked it because I was hanging out with a bunch of folks who were doing trendy stuff at the time, but the fact that it was so annoying to set up that people organized little "Octopress setup days" was a bad sign. And it turns out that, not only was it annoying to set up, it had a fair amount of breakage, used a development model that made it impossible to take upstream updates, and it was extremely slow (it didn't take long before it took a whole minute to build my blog, a ridiculous amount of time to "compile" a handful of blog posts). I should've either just written pure HTML until I had a few posts and then turned that into a custom static site generator, or used WordPress, which can be spun up in minutes and trivially moved or migrated from. But, part of the the low up-front investment involved not doing research into this and trusting that people around me were making reasonable decisions5. Overall, I stand behind the idea of keeping startup costs low, but had I just ignored all of the standard advice and either done something minimal or used the out-of-fashion but straightforward option, I would've saved myself a lot of work.

The "improve writing" goal is because I found my writing annoyingly awkward and wanted to fix that. I frequently wrote sentences or paragraphs that seemed clunky to me, like when you misspell a word and it looks wrong no matter how you try re-spelling it. Spellcheckers are now ubiquitous enough that you don't really run into the spelling problem anymore, but we don't yet have automated tools that will improve your writing (some attempts exist, but they tend to create bad writing). I didn't worry about any specific post since I figured I could easily spend years working on my writing and I didn't think that spending years re-editing a single post would be very satisfying.

As we've discussed before, getting feedback can greatly speed up skill acquisition, so I hired a professional editor whose writing I respect with the instruction "My writing is clunky and awkward and I'd like to fix it. I don't really care about spelling and grammar issues. Can you edit my writing with that in mind?". I got detailed feedback on a lot of my posts. I tried to fix the issues brought up in the feedback but, more importantly, tried to write my next post without it having the same or other previously mentioned issues. I can be a bit of a slow learner, so it sometimes took a few posts to iron out an issue but, over time, my writing improved a lot.

The only publishing when I felt like publishing is because I generally prefer process goals to outcome goals, at least with respect to personal goals. I originally had a goal of spending a certain amount of time per month blogging, but I got rid of that when I realized that I'd tend to spend enough time writing regardless of whether or not I made it an obligation. I think that outcome goals with respect to blogging do work for some people (e.g., "publish one post per week"), but if your goal is to improve writing quality, having outcome goals can be counterproductive (e.g., to hit a "publish one post per week goal" on limited time, someone might focus on getting something out the door and then not think about how to improve quality since, from the standpoint of the outcome goal, improving quality is a waste of time).

Having a goal of writing something I'd want to subscribe to is, of course, highly arbitrary. There are a bunch of things I don't like in other blogs, so I try to avoid them. Some examples:

Writing on my own platform is the most minor of these. A major reason for that comes out of what's happened to platforms. At the time I started my blog, a number of platforms had already come and gone. Most recently, Twitter had acquired Posterous and shut it down. For a while, Posterous was the trendiest platform around and Twitter's decision to kill it entirely broke links to many of the all-time top voted HN posts, among others. Blogspot, a previously trendy place to write, had also been acquired by Google and severely degraded the reader experience on many sites afterwards. Avoiding trendy platforms has worked out well. The two trendy platforms people were hopping on when I started blogging were Svbtle and Medium. Svbtle was basically abandoned shortly afterward I started my blog when it became clear that Medium was going to dominate Svbtle on audience size. And Medium never managed to find a good monetization strategy and severely degraded the user experience for readers in an attempt to generate enough revenue to justify its valuation after raising $160M. You can't trust someone else's platform to not disappear underneath you or radically change in the name of profit.

A related thing I wanted to do was write in something that's my own space (as opposed to in internet comments). I used to write a lot of HN comments6, but the half-life of an HN comment is short. With very few exceptions, basically all of the views a comment is going to get will be in the first few days. With a blog, it's the other way around. A post might get burst of traffic initially but, as long as you keep writing, most traffic will come later (e.g., for my blog, I tend to get roughly twice as many hits as the baseline level when a post is on HN, and of course I don't have a post on HN most days). It isn't really much more work to write a "real blog post" instead of writing an HN comment, so I've tended to favor writing blog posts instead of HN comments. Also, when I write here, most of the value created is split between myself and readers. If I were to write on someone else's platform, most of the value would be split between the platform and readers. If I were doing video, I might not really have a choice outside of YouTube or Twitch but, for text, I have a real choice. Looking at how things worked out for people who made the other choice and decided to write comments for a platform, I think I made the right choice for the right seasons. I do see the appeal of the reduced friction commenting on an existing platform offers but, even so, I'd rather pay the cost of the extra friction and write something that's in my space instead of elsewhere.

All of that together is basically it. That's how I write.

Unlike other bloggers, I'm not going to try to tell you "how to write usefully" or "how to write well" or anything like that. I agree with Steve Yegge when he says that you should consider writing because it's potentially high value and the value may show up in ways you don't expect, but how you write should really come from your goals and aptitudes.

Appendix: changes in approach over time

When I started the blog, I used to worry that a post wouldn't be interesting enough because it only contained a simple idea, so I'd often wait until I could combine two or more ideas into a single post. In retrospect, I think many of my early posts would've been better off as separate posts. For example, this post on compensation from 2016 contains the idea that compensation might be turning bimodal and that programmers are unbelievably well paid given the barriers to entry compared to other fields that are similarly remunerative, such has finance, law, and medicine. I don't think there was much value-add to combining the two ideas into a single post and I think a lot more people would've read the bit about how unusually highly paid programmers are if it wasn't bundled into a post about compensation becoming bimodal.

Another thing I used to do is avoid writing things that seem too obvious. But, I've come around to the idea that there's a lot of value in writing down obvious things and a number of my most influential posts have been on things I would've previously considered too obvious to write down:

Excluding these recent posts, more people have told me that https://danluu.com/look-stupid/ has changed how they operate than all other posts combined (and the only reason it's even close is that a lot of people have told me that my discussions of compensation caused them to realize that they can find a job they enjoy more that also pays hundreds of thousands a year more than they were previously making, which is the set of posts that's drawn the most comments from people telling me that the post was pointless because everybody knows how much you can make in tech).

A major, and relatively recent, style change I'm trying out is using more examples. This was prompted by comments from Ben Kuhn, and I like it so far. Compared to most bloggers, I wasn't exactly light on examples in my early days, but one thing I've noticed is that adding more examples than I would naturally tend to can really clarify things for readers; having "a lot" of examples reduces the rate at which people take away wildly different ideas than the ones I meant. A specific example of this would be, in a post discussing what it takes to get to 95%-ile performance, I only provided a couple examples and many people filled in the blanks and thought that performance that's well above 99.9%-ile is 95%-ile, e.g., that being a chess GM is 95%-ile.

Another example of someone who's made this change is Jamie Brandon. If you read his early posts, such as this one, he often has a compelling idea with a nice turn of phrase, e.g., this bit about when he was working on Eve with Chris Granger:

People regularly tell me that imperative programming is the natural form of programming because 'people think imperatively'. I can see where they are coming from. Why, just the other day I found myself saying, "Hey Chris, I'm hungry. I need you to walk into the kitchen, open the cupboard, take out a bag of bread, open the bag, remove a slice of bread, place it on a plate..." Unfortunately, I hadn't specified where to find the plate so at this point Chris threw a null pointer exception and died.

But, despite having parts that are really compelling, his earlier writing was often somewhat disconnected from the real world in a way that Jamie doesn't love when looking back on his old posts. On adding more details, Jamie says

The point of focusing down on specific examples and keeping things as concrete as possible is a) makes me less likely to be wrong, because non-concrete ideas are very hard to falsify and I can trick myself easily b) makes it more likely that the reader absorbs the idea I'm trying to convey rather than some superficially similar idea that also fits the vague text.

Examples kind of pin ideas down so they can be examined properly.

Another big change, the only one I'm going to discuss here that really qualifies as prose style, is that I try much harder to write things where there's continuity of something that's sometimes called "narrative grammar". This post by Nicola Griffith has some examples of this at the sentence level, but I also try to think about this in the larger structure of my writing. I don't think I'm particularly good at this, but thinking about this more has made my writing easier to follow. This change, especially on larger scales, was really driven by working with a professional editor who's good at spotting structural issues that make writing more difficult to understand. But, at the same time, I don't worry too much if there's a reason that something is difficult to follow. A specific example of this is, if you read answers to questions on ask metafilter or reddit, any question that isn't structurally trivial will have a large fraction of answers that from people who failed to read the question and answer the wrong question, e.g., if someone asks for something that has two parts connected with an and, many people will only read one half of the and and give an answer that's clearly disqualified by the and condition. If many people aren't going to read a short question closely enough to write up an answer that satisfies both halves of an and, many people aren't going to follow the simplest things anyone might want to write. I don't think it's a good use of a writer's time to try to walk someone who can't be bothered with reading both sides of an and through a structured post, but I do think there's value in trying to avoid "narrative grammar" issues that might make it harder for someone who does actually want to read.

Appendix: getting feedback

As we've previously discussed, feedback can greatly facilitate improvement. Unfortunately, the idea from that post, that 95%-ile performance is generally poor, also applies to feedback, making most feedback counterproductive.

I've spent a lot of time watching people get feedback in private channels and seeing how they change their writing in response to it and, at least in the channels that I've looked at (programmers and not professional writers or editors commenting), most feedback is ignored. And when feedback is taken, because almost all feedback is bad and people generally aren't perfect or even very good at picking out good feedback, the feedback that's taken is usually bad.

Fundamentally, most feedback has the issue mentioned in this post and is a form of "you should write it like I would've written it", which generally doesn't work unless the author of the feedback is very careful in how they give the feedback, which few people are. The feedback tends to be superficial advice that misses serious structural issues in writing. Furthermore, the feedback also tends to be "lowest common denominator" feedback that turns nice prose into Strunk-and-White-ified mediocre prose. I don't think that I have a particularly nice prose style, but I've seen a number of people who have a naturally beautiful style ask for feedback from programmers, which has turned their writing into boring prose that anyone could've written.

The other side of this is that when people get what I think is good, substantive, feedback, the most common response is "nah, it's fine". I think of this as the flip side of most feedback being "you should write it how I'd write it". Most people's response to feedback is "I want to write it how I want to write it".

Although this post has focused on how a wide variety of styles can work, it's also true that, given a style and a set of goals, writing can be better or worse. But, most people who are getting feedback don't know enough about writing to know what's better and what's worse, so they can't tell the difference between good feedback and bad feedback.

One way around this is to get feedback from someone whose judgement you trust. As mentioned in the post, the way I did this was by hiring a professional editor whose writing (and editing) I respected.

Another thing I do, one that's a core aspect of my personality and not really about writing, is that I take feedback relatively seriously and try to avoid having a "nah, it's fine" response to feedback. I wouldn't say that this is optimal since I've sometimes spent far too much time on bad feedback, but a core part of how I think is that I'm aware that most people are overconfident and frequently wrong because of their overconfidence, so I don't trust my own reasoning and spend a relatively large amount of time and effort thinking about feedback in an attempt to reduce my rate of overconfidence.

At times, I've spent a comically long amount of time mulling over what is, in retrospect, very bad and "obviously" incorrect feedback that I've been wary of dismissing as incorrect. One thing I've noticed is that, as people gain an audience, some people become more and more confident in themselves and eventually end up becoming highly overconfident. It's easy to see how this happens — as you gain prominence, you'll get more exposure and more "fans" who think you're always right and, on the flip side, you'll also get more "obviously" bad comments.

Back when basically no one read my blog, most of the comments I got were quite good. As I've gotten more and more readers, the percentage of good comments has dropped. From looking at how other people handle this, one common failure mode is that they'll see the massive number of obviously wrong comments that their posts draw and then incorrectly conclude that all of their critics are bozos and that they're basically never wrong. I don't really have an antidote to that other than "take criticism very seriously". Since the failure mode here involves blind spots in judgement, I don't see a simple way to take a particular piece of criticism seriously that doesn't have the potential to result in incorrectly dismissing the criticism due to a blind spot.

Fundamentally, my solution to this has been to avoid looking at most feedback while trying to take feedback from people I trust.

When it comes to issues with the prose, one thing that we discussed above, hiring a professional editor whose writing and editing I respect and deferring to them on issues with my prose worked well.

When it comes to logical soundness or just general interestingness, those are a more difficult to outsource to a single person and I have a set of people whose judgement I trust who look at most posts. If anyone whose judgement I trust thinks a post is interesting, I view that as a strong confirmation and I basically ignore comments that something is boring or uninteresting. For almost all of my posts that are among my top posts in terms of the number of people who told me the post was life changing for them, I got a number of comments from people whose judgement I otherwise think isn't terrible saying that the post seemed boring, pointless, too obvious to write, or just plain uninteresting. I used to take comments that something was uninteresting seriously but, in retrospect, that was a mistake that cost me a lot of time and didn't improve my writing. I think this isn't so different from people who say "write how I write"; instead, it's people who have a similar mental model, but with respect to interesting-ness instead, who can't imagine that other people would find something interesting that they don't. Of course, not everyone's mind works like that, but people who are good at modeling what other people find interesting generally don't leave feedback like "this is boring/pointless", so feedback of that form is almost guaranteed to be worthless.

When it comes to the soundness of an argument, I take the opposite approach that I do for interestingness, in that I take negative comments very seriously and I don't do much about positive comments. I have, sometimes, wasted a lot of time on particular posts because of that. My solution to that has been to try to ignore feedback from people who regularly give bad feedback. That's something I think of as dangerous to do since selectively choosing to ignore feedback is a good way to create an echo chamber, but really seriously taking the time to think through feedback when I don't see a logical flaw is time consuming enough that I don't think there's really another alternative given how I re-evaluate my own work when I get feedback.

One thing I've started doing recently that's made me feel a lot better about this is to look at what feedback people give to others. People who give me bad feedback generally also give other people feedback that's bad in pretty much exactly the same ways. Since I'm not really concerned that I have some cognitive bias that might mislead me into thinking I'm right and their feedback is wrong when it comes to their feedback on other people's writing, instead of spending hours trying to figure out if there's some hole in how I'm explaining something that I'm missing, I can spend minutes seeing that their feedback on someone else's writing is bogus feedback and then see that their feedback on my writing is bogus in exactly the same way.

Appendix: where I get ideas

I often get asked how I get ideas. I originally wasn't going to say anything about this because I don't have much to say, but Ben Kuhn strongly urged me to add this section "so that other people realize what an alien you are".

My feeling is that the world is so full of interesting stuff that ideas are everywhere. I have on the order of a hundred drafts lying around that I think are basically publishable that I haven't prioritized finishing up for one reason or another. If I think of ideas where I've sketched out a post in my head but haven't written it down, the number must well into the thousands. If I were to quit my job and then sit down to write full-time until I died, I think I wouldn't run out of ideas even if I stuck to ones I've already had. The world is big and wondrous and fractally interesting.

For example, I recently took up surf skiing (a kind of kayaking) and I'd say that, after a few weeks, I had maybe twenty or so blog post ideas that I think could be written up for a general audience in the sense that this post on branch prediction is written for a general audience, in that it doesn't assume any hardware background. I could write two posts on different technical aspects of canoe paddle evolution and design as well as two posts on cultural factors and how they impacted the update of different canoe paddle designs. Kayak paddle design has been, in recent history, a lot richer, and that could easily be another five or six posts. The technical aspects of hull design are richer still and could be an endless source of posts, although I only have four particular posts in mind at the moment, but the cultural and historical aspects also seem interesting to me and that's what rounds out the twenty things in my head with respect to that.

I don't have twenty posts on kayaking and canoeing in my head because I'm particularly interested in kayaking and canoeing. Everything seems interesting enough to write twenty posts about. A lot of my posts that exist are part of what might become a much longer series of posts if I ever get around to spending the time to write them up. For example, this post on decision making in baseball was, in my head, the first of a long-ish (10+) post series on decision making that I never got around to writing that I suspect I'll never write because there's too much other interesting stuff to write about and not enough time.

Appendix: other writing about writing

Appendix: things that increase popularity that I generally don't do

Here are some things that I think work based on observing what works for other people that I don't do, but if you want a broad audience, perhaps you can try some of them out:

Appendix: some snippets of writing

In case you're not familiar with the writers mentioned, here are some snippets that I think are representative of their writing styles:

Joel Spolsky:

Why I really care is that Microsoft is vacuuming up way too many programmers. Between Microsoft, with their shady recruiters making unethical exploding offers to unsuspecting college students, and Google (you're on my radar) paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex and walk around trying to get someone...anyone...to come see the demo code they've just written with their "20% time," doing some kind of, let me guess, cloud-based synchronization... between Microsoft and Google the starting salary for a smart CS grad is inching dangerously close to six figures and these smart kids, the cream of our universities, are working on hopeless and useless architecture astronomy because these companies are like cancers, driven to grow at all cost, even though they can't think of a single useful thing to build for us, but they need another 3000-4000 comp sci grads next week. And dammit foosball doesn't play itself.

Paul Graham:

A couple years ago a venture capitalist friend told me about a new startup he was involved with. It sounded promising. But the next time I talked to him, he said they'd decided to build their software on Windows NT, and had just hired a very experienced NT developer to be their chief technical officer. When I heard this, I thought, these guys are doomed. One, the CTO couldn't be a first rate hacker, because to become an eminent NT developer he would have had to use NT voluntarily, multiple times, and I couldn't imagine a great hacker doing that; and two, even if he was good, he'd have a hard time hiring anyone good to work for him if the project had to be built on NT.

Steve Yegge:

When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: "Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course." Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.

This is a wonderful book about how to write good code, and there aren't many books like it. None, maybe. They don't typically teach you how to write good code in school, and you may never learn on the job. It may take years, but you may still be missing some key ideas. I certainly was. ... If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!

Julia Evans:

Right now I’m on a million-hour train ride from New York to Montreal. So I’m looking at the output of strace because, uh, strace is cool, and it is teaching me some things about how the command line tools I use all the time work.

What strace does is capture every single system call that gets called when executing a program. System calls are the interface between userspace programs and the kernel, so looking at the output from strace is a fun way to understand how Linux works, and what’s really involved in running a program.

For example! killall! I ran

strace killall ruby1.9.1 2> killall-log.

Thanks to Yossi Kreinin, Ben Kuhn, Laurence Tratt, Heath Borders, Jamie Brandon, Julia Evans, Vegard Nossum, Julien Kirch, Bram Delver, and Pam Wolf for comments/corrections/discussion.


  1. What's worked can mean very different things for different people, but for this section we're going to look at popular blogs because, when people I know have frustratedly stopped writing after writing a blog for a while, the most common reason has been that their blog had basically no readers.

    Of course, many people write without a goal of having readers and some people even try to avoid having more than a few readers (by "locking" posts in some way so that only "friends" have access) but, I don't think the idea that "what works" is very broad and that many different styles can work changes if the goal is to have just a few friends read a blog.

    [return]
  2. This is pretty arbitrary. In other social circles, Jeff Atwood, Raymond Chen, Scott Hanselman, etc., might be on the list, but this wouldn't change the point since all of these folks also have different styles from each other as well as the people on my list. [return]
  3. 2017 is the endpoint since I reduced how much I pay attention to programming internet culture around then and don't have a good idea on what people I know were reading after 2017. [return]
  4. In sports, elite coaches that have really figured out how to cue people to do the right thing can greatly accelerate learning but, outside of sports, although there's no shortage of people who are willing to supply coaching, it's rare to find one who's really figured out what cues students can be given that will help them get to the right thing much more quickly than they would've if they just naively measured what they were doing and applied a bit of introspection. [return]
  5. It turns out that blogging has been pretty great for me (e.g., my blog got me my current job, facilitated meeting a decent fraction of my friends, results in people sending me all sorts of interesting stories about goings-on in the industry, etc.), but I don't think that was a predictable outcome before starting the blog. My guess, based on base rates, was that the most likely outcome was failure. [return]
  6. Such as this comment on how cushy programming jobs are compared to other lucrative jobs (which turned into the back half of this post on programmer compensation, this comment on writing pay, and this comment on the evolution of board game design. [return]

Some latency measurement pitfalls

2021-12-06 08:00:00

This is a pseudo-transcript (actual words modified to be more readable than a 100% faithful transcription) of a short lightning talk I did at Twitter a year or two ago, on pitfalls of how we use latency metrics (with the actual service names anonymized per a comms request). Since this presentation, significant progress has been made on this on the infra side, so the situation is much improved over what was presented, but I think this is still relevant since, from talking to folks at peer companies, many folks are facing similar issues.

We frequently use tail latency metrics here at Twitter. Most frequently, service owners want to get cluster-wide or Twitter-wide latency numbers for their services. Unfortunately, the numbers that service owners tend to use differ from what we'd like to measure due some historical quirks in our latency measurement setup:

Opaque, uninstrumented, latency

When we look at the dashboards for most services, the latency metrics that are displayed and are used for alerting are usually from the server the service itself is running on. Some services that have dashboards set up by senior SREs who've been burned by invisible latency before will also have the service's client-observed latency from callers of the service. I'd like to discuss three issues with this setup.

For the purposes of this talk, we can view a client request as passing through the following pipeline after client "user" code passes the request to our RPC layer, Finagle(https://twitter.github.io/finagle/), and before client user code receive the response (the way Finagle currently handles requests, we can't get timestamps for a particular request once the request is handled over to the network library we use, netty

client netty -> client Linux -> network -> server Linux -> server netty -> server "user code" -> server netty -> server Linux -> network -> client Linux -> client netty

As we previously saw in [an internal document quantifying the impact of CFS bandwidth control throttling and how our use of excessively large thread pools causes throttling]1, we frequently get a lot of queuing in and below netty, which has the knock-off effect of causing services to get throttled by the kernel, which often results in a lot of opaque latency, especially when under high load, when we most want dashboards to show correct latency numbers..

When we sample latency at the server, we basically get latency from

When we sample latency at the client, we basically get

Two issues with this are that we don't, with metrics data, have a nice way to tell if latency is in the opaque parts of the stack are coming from the client or the server. As a service owner, if you set alerts based on client latency, you'll get alerted when client latency rises because there's too much queuing in netty or Linux on the client even when your service is running smoothly.

Also, the client latency metrics that are reasonable to look at given what we expose give you latency for all servers a client talks to, which is a really different view from what we see on server metrics, which gives us per-server latency numbers and there isn't a good way to aggregate per-server client numbers across all clients, so it's difficult to tell, for example, if a particular instance of a server has high latency in netty.

Below are a handful examples of cluster-wide measurements of latency measured at the client vs. the server. These were deliberately selected to show a cross-section of deltas between the client and the server.

Graph showing large difference between latency measured at the client vs. at the server

This is a CDF, presented with the standard orientation for a CDF, with the percentile is on the y-axis and the value on the x-axis, which makes down and to the right higher latency and up and to the left lower latency, and a flatter line meaning latency is increasing quickly and a steeper line meaning that latency is increasing more slowly.

Because the chart is log scale on both axes, the difference between client and server latency is large even though the lines don't look all that far apart. For example, if we look at 99%-ile latency, we can see that it's ~16ms when measured at the server and ~240ms when measured at the client, a factor of 15 difference. Alternately, if we look at a fixed latency, like 240ms, and look up the percentile, we see that's 99%-ile latency on the client, but well above 99.9%-ile latency on the server.

The graphs below have similar properties, although the delta between client and server will vary.

Graph showing moderate at the client vs. at the server until p99.5, with large difference above p99.5 Graph showing small difference between latency measured at the client vs. at the server until p74, with increasing divergence after that Graph showing moderate difference between latency measured at the client vs. at the server until close to client timeout value, with large divergence near timeout value Graph showing small difference between latency measured at the client vs. at the server unil p999, with rapid increasing after that

We can see that latencies often differ significantly when measured at the client vs. when measured at the server and that, even in cases where the delta is small for lower percentiles, it sometimes gets large at higher percentiles, where more load can result in more queueing and therefore more latency in netty and the kernel.

One thing to note is that, for any particular measured server latency value, we see a very wide range of client latency values. For example, here's a zoomed in scatterplot of client vs. server latency for service-5. If we were to zoom out, we'd see that for a request with a server-measured latency of 10ms, we can see client-measured latencies as high as 500ms. More generally, we see many requests where the server-measured latency is very similar to the client-measured latency, with a smattering of requests where the server-measured latency is a very inaccurate representation of the client-measured latency. In almost all of those cases, the client-measured latency is higher due to queuing in a part of the stack that's opaque to us and, in a (very) few cases, the client-measured latency is lower due to some issues in our instrumentation. In the plot below, due to how we track latencies, we only have 1ms granularity on latencies. The points on the plots below have been randomly jittered by +/- 0.4ms to give a better idea of the distribution at points on the plot that are very dense.

Per-request scatterplot of client vs. server latency, showing that any particular server latency value can be associated with a very wide range of client latency values

While it's possible to plumb instrumentation through netty and the kernel to track request latencies after Finagle has handed them off (the kernel even has hooks that would make this somewhat straightforward), that's probably more work than is worth it in the near future. If you want to get an idea for how your service is impacted by opaque latency, it's fairly easy to get a rough idea with Zipkin if you leverage the work Rebecca Isaacs, Jonathan Simms, and Rahul Iyer have done, which is how I generated the plots above. The code for these is checked into [a path in our monorepo] and you can plug in your own service names if you just want to check out a different service.

Lack of cluster-wide aggregation capability

In the examples above, we were able to get cluster-wide latency percentiles because we used data from Zipkin, which attempts to sample requests uniformly at random. For a variety of reasons, service owners mostly rely on metrics data which, while more complete because it's unsampled, doesn't let us compute cluster-wide aggregates because we pre-compute fixed aggregations on a per-shard basis and there's no way to reconstruct the cluster-wide aggregate from the per-shard aggregates.

From looking at dashboards of our services, the most common latency target is a per-shard average of shard-level 99%-ile latency (with some services that are deep in the request tree, like cache, using numbers further in the tail). Unfortunately, taking the average of per-shard tail latency defeats the purpose of monitoring tail latency. If we think about why we want to use tail latency because, when we have high fanout and high depth request trees, a very small fraction of server responses slowing down can slow down many or most top-level requests, taking the average of tail latency fails to capture the value of using tail latency since the average of shard-level tail latencies fails to capture the property that a small fraction of server responses being slow can slow down many or most requests while also missing out on the advantages of looking at cluster-wide averages, which can be reconstructed from per-shard averages.

For example, when we have a few bad nodes returning , that has a small impact on the average per-shard tail latency even though cluster-wide tail latency will be highly elevated. As we saw in [a document quantifying the extent of machine-level issues across the fleet as well as the impact on data integrity and performance]2, we frequently have host-level issues that can drive tail latency on a node up by one or more orders of magnitude, which can sometimes drive median latency on the node up past the tail latency on other nodes. Since a few or even one such node can determine the tail latency for a cluster, taking the average across all nodes can be misleading, e.g., if we have a 100 node cluster where tail latency is up by 10x on one node, this might cause our average of cluster-wide latencies to increase by a factor of 0.99 + 0.01 * 10 = 1.09 when the actual increase in tail latency is much larger.

Some service owners try to get a better approximation of cluster-wide tail latency by taking a percentile of the 99%-ile, often the 90%-ile or the 99%-ile, but this doesn't work either and there is, in general, no per-shard percentile or other aggregation of per-shard tail latencies that can reconstruct the cluster-level tail latency.

Below are plots of the various attempts that people have on dashboards to get cluster-wide latency with instance-level metrics data vs. actual (sampled) cluster-wide latency on a service which makes the percentile of percentile attempts more accurate than for smaller services. We can see the correlation is very weak and has the problem we expect, where the average of the tail isn't influenced by outlier shards as much as it "should be" and the various commonly used percentiles either aren't influenced enough or are influenced too much, on average and are also weakly correlated with the actual latencies. Because we track metrics with minutely granularity, each point in the graphs below represents one minute, with the sampled cluster-wide p999 latency on the x-axis and the dashboard aggregated metric value on the y-axis. Because we have 1ms granularity on individual latency measurements from our tracing pipeline, points are jittered horizontally +/- 0.3ms to give a better idea of the distribution (no such jitter is applied vertically since we don't have this limitation in our metrics pipeline, so that data is higher precision).

Per-minute scatterplot of average of per-shard p999 vs. actual p999, showing that average of per-shard p999 is a very poor approximation Per-minute scatterplot of p99 of per-shard p999 vs. actual p999, showing that p99 of per-shard p999 is a poor approximation Per-minute scatterplot of p999 of per-shard p999 vs. actual p999, showing that p999 of per-shard p999 is a very poor approximation

The correlation between cluster-wide latency and aggregations of per-shard latency is weak enough that even if you pick the aggregation that results in the correct average behavior, the value will still be quite wrong for almost all samples (minutes). Given our infra, the only solutions that can really work here are extending our tracing pipeline for use on dashboards and with alerts or adding metric histograms to Finagle and plumbing that data up through everything and the into [dashboard software] so that we can get proper cluster-level aggregations3.

While it's popular to take the average of tail latencies because it's easy and people are familiar with it (e.g., the TL of observability at [redacted peer company name] has said that they shouldn't bother with anything other than averages because everyone just wants averages), taking the average or another aggregation of shard-level tail latencies has neither the properties people want nor the properties people expect.

Minutely resolution

Another, independent, issue that's a gap in our ability to observe what's going on with our infrastructure is that we only collect metrics at a minutely granularity. Rezolus does metrics collection on a secondly (and in some cases, even sub-secondly) granularity, but for reasons that are beyond the scope of this talk, it's generally only used for system-level metrics (with a few exceptions).

We've all seen incidents where some bursty, sub-minutely event, is the cause of a problem. Let's look at an example of one such incident. In this incident, a service had elevated latency and error rate. Looking at the standard metrics we export wasn't informative, but looking at sub-minutely metrics immediately reveals a clue:

Plot of per-request latency for sampled requests, showing large spike followed by severely reduced request rate

For this particular shard of a cache (and many others, not shown), there's a very large increase in latency at time 0, followed by 30 seconds of very low request rate. The 30 seconds is because shards of service-6 were configured to mark servers they talk to as dead for 30 seconds if service-6 clients encounter too many failed requests. This decision is distributed, which is why the request rate to the impacted shard of cache-1 isn't zero; some shards of service-6 didn't send requests to that particular shard of cache-1 during during the period of elevated latency, so they didn't mark that shard of cache-1 as dead and continued to issue requests.

A sub-minutely view of request latency made it very obvious what mechanism caused elevated error rates and latency in service-6.

One thing to note is that the lack of sub-minutely visibility wasn't the only issue here. Much of the elevated latency was in places that are invisible to the latency metric, resulting in monitoring cache-1 latencies insufficient to detect the issue. Below, the reported latency metrics for a single instance of cache-1 are the blue points and the measured (sampled) latency the client observed is the black line4. Reported p99 latency is 0.37ms, but actual p99 latency is ~580ms, an over three order of magnitude difference.

Plot of reported metric latency vs. latency from trace data, showing extremely large difference between metric latency and trace latency

Summary

Although our existing setup for reporting and alerting on latency works pretty decently, in that the site generally works and our reliability is actually quite good compared to peer companies in our size class, we do pay some significant costs as a result of our setup.

One is that we often have incidents where it's difficult to see what's going on without using tools that are considered specialized that most people don't use, adding to the toil of being on call. Another is that, due to large margins of error in our estimates of cluster-wide latencies, we have to have to provision a very large amount of slack and keep latency SLOs that are much stricter than the actual latencies we want to achieve to avoid user-visible incidents. This increases operating costs as we've seen in [a document comparing per-user operating costs to companies that serve similar kinds of and levels of traffic].

If you enjoyed this post you might like to read about tracing on a single host vs. sampling profilers.

Appendix: open vs. closed loop latency measurements

Some of our synthetic benchmarking setups, such as setup-1, use "closed-loop" measurement, where they effectively send a single request, wait for it to come back, and then send another request. Some of these allow for a degree of parallelism, where N request can be in flight at once but that still has similar problems in terms of realism.

For a toy example of the problem, let's say that we have a service that, in production, receives exactly 1 request every second and that the service has a normal response time of 1/2 second. Under normal behavior, if we issue requests at 1 per second, we'll observe that the mean, median, and all percentile request times are 1/2 second. As an exercise for the reader, compute the mean and 90%-ile latency if the service has no parallelism and one request takes 10 seconds in the middle of a 1 minute benchmark run for a closed vs. open loop benchmark setup where the benchmarking setup issues requests at 1 per second for the open loop case, and 1 per second but waits for the previous request to finish in the closed loop case.

For more info on this, see Nitsan Wakart's write-up on fixing this issue in the YCSB benchmark or Gil Tene's presentation on this issue.

Appendix: use of unweighted averages

An common issue with averages on dashboards that I've looked at that's independent of the issues that come up when we take the average of tail latencies is that an unweighted average frequently underestimates the actual latency.

Two places I commonly see an unweighted average are when someone gets an overall latency by taking an unweighted average across datacenters and when someone gets a cluster-wide latency by taking an average across shards. Both of these have the same issue, that shards that have lower load tend to have lower latency. This is especially pronounced when we fail away from a datacenter. Services that incorrectly use an unweighted average across datacenters will often show decreased latency even though actually served requests have increased latency.

Thanks to Ben Kuhn for comments/corrections/discussion.


  1. This is another item that's somewhat out of date, since this document motivated work from Flavio Brasil and Vladimir Kostyukov to do work on Finagle that reduces the impact of this problem and then, later, work from my then-intern, Xi Yang, on a patch to the kernel scheduler that basically eliminates the problem by preventing cgroups from exceeding their CPU allocation (as opposed to the standard mechanism, which allows cgroups to exceed their allocation and then effectively puts the cgroup to sleep until its amortized cpu allocation is no longer excessive, which is very bad for tail latency). [return]
  2. This is yet another item that's out of date since the kernel, HWENG, and the newly created fleet health team have expended significant effort to drive down the fraction of unhealthy machines. [return]
  3. This is also significantly out of date today. Finagle does now support exporting shard-level histogram data and this can be queried via one-off queries by hitting the exported metrics endpoint. [return]
  4. As we previously noted, opaque latency could come from either the server or the client, but in this case, we have strong evidence that the latency is coming from the cache-1 server and not the service-6 client because opaque latency from the service-6 client should be visible on all requests from service-6 but we only observe elevated opaque latency on requests from service-6 to cache-1 and not to the other servers it "talks to". [return]

Major errors on this blog (and their corrections)

2021-11-22 08:00:00

Here's a list of errors on this blog that I think were fairly serious. While what I think of as serious is, of course, subjective, I don't think there's any reasonable way to avoid that because, e.g., I make a huge number of typos, so many that the majority of acknowledgements on many posts are for people who e-mailed or DM'ed me typo fixes.

A list that included everything, including typos would both be uninteresting for other people to read as well as high overhead for me, which is why I've drawn the line somewhere. An example of an error I don't think of as serious is, in this post on how I learned to program, I originally had the dates wrong on when the competition programmers from my high school made money (it was a couple years after I thought it was). In that case, and many others, I don't think that the date being wrong changes anything significant about the post.

Although I'm publishing the original version of this in 2021, I expect this list to grow over time. I hope that I've become more careful and that the list will grow more slowly in the future than it has in the past, but that remains to be seen. I view it as a good sign that a large fraction of the list is from my first three months of blogging, in 2013, but that's no reason to get complacent!

I've added a classification below that's how I think of the errors, but that classification is also arbitrary and the categories aren't even mutually exclusive. If I ever collect enough of these that it's difficult to hold them all in my head at once, I might create a tag system and use that to classify them instead, but I hope to not accumulate so many major errors that I feel like I need a tag system for readers to easily peruse them.

Thanks to Anja Boskovic and Ville Sundberg for comments/corrections/discussion.

Individuals matter

2021-11-15 08:00:00

One of the most common mistakes I see people make when looking at data is incorrectly using an overly simplified model. A specific variant of this that has derailed the majority of work roadmaps I've looked at is treating people as interchangeable, as if it doesn't matter who is doing what, as if individuals don't matter.

Individuals matter.

A pattern I've repeatedly seen during the roadmap creation and review process is that people will plan out the next few quarters of work and then assign some number of people to it, one person for one quarter to a project, two people for three quarters to another, etc. Nominally, this process enables teams to understand what other teams are doing and plan appropriately. I've never worked in an organization where this actually worked, where this actually enabled teams to effectively execute with dependencies on other teams.

What I've seen happen instead is, when work starts on the projects, people will ask who's working the project and then will make a guess at whether or not the project will be completed on time or in an effective way or even be completed at all based on who ends up working on the project. "Oh, Joe is taking feature X? He never ships anything reasonable. Looks like we can't depend on it because that's never going to work. Let's do Y instead of Z since that won't require X to actually work". The roadmap creation and review process maintains the polite fiction that people are interchangeable, but everyone knows this isn't true and teams that are effective and want to ship on time can't play along when the rubber hits the road even if they play along with the managers, directors, and VPs, who create roadmaps as if people can be generically abstracted over.

Another place the non-fungibility of people causes predictable problems is with how managers operate teams. Managers who want to create effective teams1 end up fighting the system in order to do so. Non-engineering orgs mostly treat people as fungible, and the finance org at a number of companies I've worked for forces the engineering org to treat people as fungible by requiring the org to budget in terms of headcount. The company, of course, spends money and not "heads", but internal bookkeeping is done in terms of "heads", so $X of budget will be, for some team, translated into something like "three staff-level heads". There's no way to convert that into "two more effective and better-paid staff level heads"2. If you hire two staff engineers and not a third, the "head" and the associated budget will eventually get moved somewhere else.

One thing I've repeatedly seen is that a hiring manager will want to hire someone who they think will be highly effective or even just someone who has specialized skills and then not be able to hire because the company has translated budget into "heads" at a rate that doesn't allow for hiring some kind of heads. There will be a "comp team" or other group in HR that will object because the comp team has no concept of "an effective engineer" or "a specialty that's hard to hire for"; for a person, role, level, and location defines them and someone who's paid too much for their role and level is therefore a bad hire. If anyone reasonable had power over the process that they were willing to use, this wouldn't happen but, by design, the bureaucracy is set up so that few people have power3.

A similar thing happens with retention. A great engineer I know who was regularly creating $x0M/yr4 of additional profit for the company per year wanted to move home to Portugal, so the company cut the person's cash comp by a factor of four. The company also offered to only cut his cash comp by a factor of two if he moved to Spain instead of Portugal. He left for a company that doesn't have location-based pay. This was escalated up to the director level, but that wasn't sufficient to override HR, so they left. HR didn't care that the person made the company more money than HR saves by doing location adjustments for all international employees combined because HR at the company had no notion of the value of an employee, only the cost, title, level, and location5.

Relatedly, a "move" I've seen twice, once from a distance and once from up close, is when HR decides attrition is too low. In one case, the head of HR thought that the company's ~5% attrition was "unhealthy" because it was too low and in another, HR thought that the company's attrition sitting at a bit under 10% was too low. In both cases, the company made some moves that resulted in attrition moving up to what HR thought was a "healthy" level. In the case I saw from a distance, folks I know at the company agree that the majority of the company's best engineers left over the next year, many after only a few months. In the case I saw up close, I made a list of the most effective engineers I was aware of (like the person mentioned above who increased the company's revenue by 0.7% on his paternity leave) and, when the company successfully pushed attrition to over 10% overall, the most effective engineers left at over double that rate (which understates the impact of this because they tended to be long-tenured and senior engineers, where the normal expected attrition would be less than half the average company attrition).

Some people seem to view companies like a game of SimCity, where if you want more money, you can turn a knob, increase taxes, and get more money, uniformly impacting the city. But companies are not a game of SimCity. If you want more attrition and turn a knob that cranks that up, you don't get additional attrition that's sampled uniformly at random. People, as a whole, cannot be treated as an abstraction where the actions company leadership takes impacts everyone in the same way. The people who are most effective will be disproportionately likely to leave if you turn a knob that leads to increased attrition.

So far, we've talked about how treating individual people as fungible doesn't work for corporations but, of course, it also doesn't work in general. For example, a complaint from a friend of mine who's done a fair amount of "on the ground" development work in Africa is that a lot of people who are looking to donate want, clear, simple criteria to guide their donations (e.g., an RCT showed that the intervention was highly effective). But many effective interventions cannot have their impact demonstrated ex ante in any simple way because, among other reasons, the composition of the team implementing the intervention is important, resulting in a randomized trial or other experiment not being applicable to team implementing the intervention other than the teams from the trial in the context they were operating in during the trial.

An example of this would be an intervention they worked on that, among other things, helped wipe out guinea worm in a country. Ex post, we can say that was a highly effective intervention since it was a team of three people operating on a budget of $12/(person-day)6 for a relatively short time period, making it a high ROI intervention, but there was no way to make a quantitative case for the intervention ex ante, nor does it seem plausible that there could've been a set of randomized trials or experiments that would've justified the intervention.

Their intervention wasn't wiping out guinea worm, that was just a side effect. The intervention was, basically, travelling around the country and embedding in regional government offices in order to understand their problems and then advise/facilitate better decision making. In the course of talking to people and suggesting improvements/changes, they realized that guinea worm could with better distribution of clean water (guinea worm can come from drinking unfiltered water; giving people clean water can solve that problem) and that aid money flowing into the country specifically for water-related projects, like building wells, was already sufficient if the it was distributed to places in the country that had high rates of guinea worm due to contaminated water instead of to the places aid money was flowing to (which were locations that had a lot of aid money flowing to them for a variety of reasons, such as being near a local "office" that was doing a lot of charity work). The specific thing this team did to help wipe out guinea worm was to give powerpoint presentations to government officials on how the government could advise organizations receiving aid money on how those organizations could more efficiently place wells. At the margin, wiping out guinea worm in a country would probably be sufficient for the intervention to be high ROI, but that's a very small fraction of the "return" from this three person team. I only mention it because it's a self-contained easily-quantifiable change. Most of the value of "leveling up" decision making in regional government offices is very difficult to quantify (and, to the extent that it can be quantified, will still have very large error bars).

Many interventions that seem the same ex ante, probably even most, produce little to no impact. My friend has a lot of comments on organizations that send a lot of people around to do similar sounding work but that produce little value, such as the Peace Corps.

A major difference between my friend's team and most teams is that my friend's team was composed of people who had a track record of being highly effective across a variety of contexts. In an earlier job, my friend started a job at a large-ish ($5B/yr revenue) government-run utility company and was immediately assigned a problem that, unbeknownst to her, had been an open problem for years that was considered to be unsolvable. No one was willing to touch the problem, so they hired her because they wanted a scapegoat to blame and fire when the problem blew up. Instead, she solved the problem she was assigned to as well as a number of other problems that were considered unsolvable. A team of three such people will be able to get a lot of mileage out of potentially high ROI interventions that most teams would not succeed at, such as going to a foreign country and improving governmental decision making in regional offices across the country enough that the government is able to solve serious open problems that had been plaguing the country for decades.

Many of the highest ROI interventions are similarly skill intensive and not amenable to simple back-of-the-envelope calculations, but most discussions I see on the topic, both in person and online, rely heavily on simplistic but irrelevant back-of-the-envelope calculations. This is not just a problem limited to cocktail-party conversations. My friend's intervention was almost killed by the organization she worked for because the organization was infested with what she thinks of "overly simplistic EA thinking", which caused leadership in the organization to try to redirect resources to projects where the computation of expected return was simpler because those projects were thought to be higher impact even though they were, ex post, lower impact. Of course, we shouldn't judge interventions on how they performed ex post since that will overly favor high variance interventions, but I think that someone thinking it through, who was willing to exercise their judgement instead of outsourcing their judgement to a simple metric, could and should say that the intervention in question was a good choice ex ante.

This issue of projects which are more legible getting more funding is an issue across organizations as well as within them. For example, my friend says that, back when GiveWell was mainly or only recommending charities that had simply quantifiable return, she basically couldn't get her friends who worked in other fields to put resources towards efforts that weren't endorsed by GiveWell. People who didn't know about her aid background would say things like "haven't you heard of GiveWell?" when she suggested putting resources towards any particular cause, project, or organization.

I talked to a friend of mine who worked at GiveWell during that time period about this and, according to him, the reason GiveWell initially focused on charities that had easily quantifiable value wasn't that they thought those were the highest impact charities. Instead, it was because, as a young organization, they needed to be credible and it's easier to make a credible case for charities whose value is easily quantifiable. He would not, and he thinks GiveWell would not, endorse donors funnelling all resources into charities endorsed by GiveWell and neglecting other ways to improve the world. But many people want the world to be simple and apply the algorithm "charity on GiveWell list = good; not on GiveWell list = bad" because it makes the world simple for them.

Unfortunately for those people, as well as for the world, the world is not simple.

Coming back to the tech company examples, Laurence Tratt notes something that I've also observed:

One thing I've found very interesting in large organisations is when they realise that they need to do something different (i.e. they're slowly failing and want to turn the ship around). The obvious thing is to let a small team take risks on the basis that they might win big. Instead they tend to form endless committees which just perpetuate the drift that caused the committees to be formed in the first place! I think this is because they really struggle to see people as anything other than fungible, even if they really want to: it's almost beyond their ability to break out of their organisational mould, even when it spells long-term doom.

One lens we can use to look at what's going on is legibility. When you have a complex system, whether that's a company with thousands of engineers or a world with many billions of dollars going to aid work, the system is too complex for any decision maker to really understand, whether that's an exec at a company or a potential donor trying to understand where their money should go. One way to address this problem is by reducing the perceived complexity of the problem via imagining that individuals are fungible, making the system more legible. That produces relatively inefficient outcomes but, unlike trying to understand the issues at hand, it's highly scalable, and if there's one thing that tech companies like, it's doing things that scale, and treating a complex system like it's SimCity or Civilization is highly scalable. When returns are relatively evenly distributed, losing out on potential outlier returns in the name of legibility is a good trade-off. But when ROI is a heavy-tailed distribution, when the right person can, on their paternity leave, increase company revenue of a giant tech company by 0.7% and then much more when they work on that full-time, then severely tamping down on the right side of the curve to improve legibility is very costly and can cost you the majority of your potential returns.

Thanks to Laurence Tratt, Pam Wolf, Ben Kuhn, Peter Bhat Harkins, John Hergenroeder, Andrey Mishchenko, Joseph Kaptur, and Sophia Wisdom for comments/corrections/discussion.

Appendix: re-orgs

A friend of mine recently told me a story about a trendy tech company where they tried to move six people to another project, one that the people didn't want to work on that they thought didn't really make sense. The result was that two senior devs quit, the EM retired, one PM was fired (long story), and three people left the team. The team for both the old project and the new project had to be re-created from scratch.

It could be much worse. In that case, at least there were some people who didn't leave the company. I once asked someone why feature X, which had been publicly promised, hadn't been implemented yet and also the entire sub-product was broken. The answer was that, after about a year of work, when shipping the feature was thought to be weeks away, leadership decided that the feature, which was previously considered a top priority, was no longer a priority and should be abandoned. The team argued that the feature was very close to being done and they just wanted enough runway to finish the feature. When that was denied, the entire team quit and the sub-product has slowly decayed since then. After many years, there was one attempted reboot of the team but, for reasons beyond the scope of this story, it was done with a new manager managing new grads and didn't really re-create what the old team was capable of.

As we've previously seen, an effective team is difficult to create, due to the institutional knowledge that exists on a team, as well as the team's culture, but destroying a team is very easy.

I find it interesting that so many people in senior management roles persist in thinking that they can re-direct people as easily as opening up the city view in Civilization and assigning workers to switch from one task to another when the senior ICs I talk to have high accuracy in predicting when these kinds of moves won't work out.


  1. On the flip side, there are managers who want to maximize the return to their career. At every company I've worked at that wasn't a startup, doing that involves moving up the ladder, which is easiest to do by collecting as many people as possible. At one company I've worked for, the explicitly stated promo criteria are basically "how many people report up to this person".

    Tying promotions and compensation to the number of people managed could make sense if you think of people as mostly fungible, but is otherwise an obviously silly idea.

    [return]
  2. This isn't quite this simple when you take into account retention budgets (money set aside from a pool that doesn't come out of the org's normal budget, often used to match offers from people who are leaving), etc., but adding this nuance doesn't really change the fundamental point. [return]
  3. There are advantages to a system where people don't have power, such as mitigating abuses of power, various biases, nepotism, etc. One can argue that reducing variance in outcomes by making people powerless is the preferred result, but in winner-take-most markets, which many tech markets are, forcing everyone lowest-common-denominator effectiveness is a recipe for being an also ran.

    A specific, small-scale, example of this is the massive advantage companies that don't have a bureaucratic comms/PR approval process for technical blog posts have. The theory behind having the onerous process that most companies have is that the company is protected from downside risk of a bad blog post, but examples of bad engineering blog posts that would've been mitigated by having an onerous process are few and far between, whereas the companies that have good processes for writing publicly get a lot of value that's easy to see.

    A larger scale example of this is that the large, now >= $500B companies, all made aggressive moves that wouldn't have been possible at their bureaucracy laden competitors, which allowed them to wipe the floor with their competitors. Of course, many other companies that made serious bets instead of playing it safe failed more quickly than companies trying to play it safe, but those companies at least had a chance, unlike the companies that played it safe.

    [return]
  4. I'm generally skeptical of claims like this. At multiple companies that I've worked for, if you tally up the claimed revenue or user growth wins and compare them to actual revenue or user growth, you can see that there's some funny business going on since the total claimed wins are much larger than the observed total.

    Just because I'm generally curious about measurements, I sometimes did my own analysis of people's claimed wins and I almost always came up with an estimate that was much lower than the original estimate. Of course, I generally didn't publish these results internally since that would, in general, be a good way to make a lot of enemies without causing any change. In one extreme case, I found that the experimental methodology one entire org used was broken, causing them to get spurious wins in their A/B tests. I quietly informed them and they did nothing about it, which was the only reasonable move for them since having experiments that systematically showed improvement when none existed was a cheap and effective way for the org to gain more power by having its people get promoted and having more headcount allocated to it. And if anyone with power over the bureaucracy cared about accuracy of results, such a large discrepancy between claimed wins and actual results couldn't exist in the first place.

    Anyway, despite my general skepticism of claimed wins in general, I found this person's claimed wins highly credible after checking them myself. A project of theirs, done on their paternity leave (done while on leave because their manager and, really, the organization as well as the company, didn't support the kind of work they were doing) increased the company's revenue by 0.7%, robust and actually increasing in value through a long-term holdback, and they were able to produce wins of that magnitude after leadership was embarrassed into allowing them to do valuable work.

    P.S. If you'd like to play along at home, another fun game you can play after figuring out which teams and orgs hit their roadmap goals. For bonus points, plot the percentage of roadmap goals a team hits vs. their headcount growth as well as how predictive hitting last quarter's goals are for hitting next quarter's goals across teams.

    [return]
  5. I've seen quite a few people leave their employers due to location adjustments during the pandemic. In one case, HR insisted the person was actually very well compensated because, even though it might appear as if the person isn't highly paid because they were paid significantly less than many people who were one level below them, according to HR's formula, which included a location-based pay adjustment, the person was one of the highest paid people for their level at the entire company in terms of normalized pay. Putting aside abstract considerations about fairness, for an employee, HR telling them that they're highly paid given their location is like HR having a formula that pays based on height telling an employee that they're well paid for their height. That may be true according to whatever formula HR has but, practically speaking, that means nothing to the employee, who can go work somewhere that has a smaller height-based pay adjustment.

    Companies were able to get away with severe location-based pay adjustments with no cost to themselves before the pandemic. But, since the pandemic, a lot of companies have ramped up remote hiring and some of those companies have relatively small location-based pay adjustments, which has allowed them to disproportionately hire away who they choose from companies that still maintain severe location-based pay adjustments.

    [return]
  6. Technically, their budget ended up being higher than this because one team member contracted typhoid and paid for some medical expenses from their personal budget and not from the organization's budget, but $12/(person-day), the organizational funding, is a pretty good approximation. [return]

Culture matters

2021-11-08 08:00:00

Three major tools that companies have to influence behavior are incentives, process, and culture. People often mean different things when talking about these, so I'll provide an example of each so we're on the same page (if you think that I should be using a different word for the concept, feel free to mentally substitute that word).

If you read "old school" thought leaders, many of them advocate for a culture-only approach, e.g., Ken Thompson saying, to reduce bug rate, that tools (which, for the purposes of this post, we'll call process) aren't the answer, having people care to and therefore decide to avoid writing bugs is the answer or Bob Martin saying "The solution to the software apocalypse is not more tools. The solution is better programming discipline."

The emotional reaction those kinds of over-the-top statements evoke combined with the ease of rebutting them has led to a backlash against cultural solutions, leading people to say things like "you should never say that people need more discipline and you should instead look at the incentives of the underlying system", in the same way that the 10x programmer meme and the associated comments have caused a backlash that's led to people to say things like velocity doesn't matter at all or there's absolutely no difference in velocity between programmers (as Jamie Brandon has noted, a lot of velocity comes down to caring about and working on velocity, so this is also part of the backlash against culture).

But if we look at quantifiable output, we can see that, even if processes and incentives are the first-line tools a company should reach for, culture also has a large impact. For example, if we look at manufacturing defect rate, some countries persistently have lower defect rates than others on a timescale of decades1, generally robust across companies, even when companies are operating factories in multiple countries and importing the same process and incentives to each factory to the extent that's possible, due to cultural differences that impact how people work.

Coming back to programming, Jamie's post on "moving faster" notes:

The main thing that helped is actually wanting to be faster.

Early on I definitely cared more about writing 'elegant' code or using fashionable tools than I did about actually solving problems. Maybe not as an explicit belief, but those priorities were clear from my actions.

I probably also wasn't aware how much faster it was possible to be. I spent my early career working with people who were as slow and inexperienced as I was.

Over time I started to notice that some people are producing projects that are far beyond what I could do in a single lifetime. I wanted to figure out how to do that, which meant giving up my existing beliefs and trying to discover what actually works.

I was lucky to have the opposite experience starting out since my first full-time job was at Centaur, a company that, at the time, had very high velocity/productivity. I'd say that I've only ever worked on one team with a similar level of productivity, and that's my current team, but my current team is fairly unusual for a team at a tech company (e.g., the median level on my team is "senior staff")2. A side effect of having started my career at such a high velocity company is that I generally find the pace of development slow at big companies and I see no reason to move slowly just because that's considered normal. I often hear similar comments from people I talk to at big companies who've previously worked at non-dysfunctional but not even particularly fast startups. A regular survey at one of the trendiest companies around asks "Do you feel like your dev speed is faster or slower than your previous job?" and the responses are bimodal, depending on whether the respondent came from a small company or a big one (with dev speed at TrendCo being slower than at startups and faster than at larger companies).

There's a story that, IIRC, was told by Brian Enos, where he was practicing timed drills with the goal of practicing until he could complete a specific task at or under his usual time. He was having a hard time hitting his normal time and was annoyed at himself because he was slower than usual and kept at it until he hit his target, at which point he realized he misremembered the target and was accidentally targeting a new personal best time that was better than he thought was possible. While it's too simple to say that we can achieve anything if we put our minds to it, almost none of us are operating at anywhere near our capacity and what we think we can achieve is often a major limiting factor. Of course, at the limit, there's a tradeoff between velocity and quality and you can't get velocity "for free", but, when it comes to programming, we're so far from the Pareto frontier that there are free wins if you just realize that they're available.

One way in which culture influences this is that people often absorb their ideas of what's possible from the culture they're in. For a non-velocity example, one thing I noticed after attending RC was that a lot of speakers at the well-respected non-academic non-enterprise tech conferences, like Deconstruct and Strange Loop, also attended RC. Most people hadn't given talks before attending RC and, when I asked people, a lot of people had wanted to give talks but didn't realize how straightforward the process for becoming a speaker at "big" conferences is (have an idea, write it down, and then submit what you wrote down as a proposal). It turns out that giving talks at conferences is easy to do and a major blocker for many folks is just knowing that it's possible. In an environment where lots of people give talks and, where people who hesitantly ask how they can get started are told that it's straightforward, a lot of people will end up giving talks. The same thing is true of blogging, which is why a disproportionately large fraction of widely read programming bloggers started blogging seriously after attending RC. For many people, the barrier to starting a blog is some combination of realizing it's feasible to start a blog and that, from a technical standpoint, it's very easy to start a blog if you just pick any semi-reasonable toolchain and go through the setup process. And then, because people give talks and write blog posts, they get better at giving talks and writing blog posts so, on average, RC alums are probably better speakers and writers than random programmers even though there's little to no skill transfer or instruction at RC.

Another kind of thing where culture can really drive skills are skills that are highly attitude dependent. An example of this is debugging. As Julia Evans has noted, having a good attitude is a major component of debugging effectiveness. This is something Centaur was very good at instilling in people, to the point that nearly everyone in my org at Centaur would be considered a very strong debugger by tech company standards.

At big tech companies, it's common to see people give up on bugs after trying a few random things that didn't work. In one extreme example, someone I know at a mid-10-figure tech company said that it never makes sense to debug a bug that takes more than a couple hours to debug because engineer time is too valuable to waste on bugs that take longer than that to debug, an attitude this person picked up from the first team they worked on. Someone who picks up that kind of attitude about debugging is unlikely to become a good debugger until they change their attitude, and many people, including this person, carry the attitudes and habits they pick up at their first job around for quite a long time3.

By tech standards, Centaur is an extreme example in the other direction. If you're designing a CPU, it's not considered ok to walk away from a bug that you don't understand. Even if the symptom of the bug isn't serious, it's possible that the underlying cause is actually serious and you won't observe the more serious symptom until you've shipped a chip, so you have to go after even seemingly trivial bugs. Also, it's pretty common for there to be no good or even deterministic reproduction of a bug. The repro is often something like "run these programs with these settings on the system and then the system will hang and/or corrupt data after some number of hours or days". When debugging a bug like that, there will be numerous wrong turns and dead ends, some of which can eat up weeks or months. As a new employee watching people work on those kinds of bugs, what I observed was that people would come in day after day and track down bugs like that, not getting frustrated and not giving up. When that's the culture and everyone around you has that attitude, it's natural to pick up the same attitude. Also, a lot of practical debugging skill is applying tactical skills picked up from having debugged a lot of problems, which naturally falls out of spending a decent amount of time debugging problems with a positive attitude, especially with exposure to hard debugging problems.

Of course, most bugs at tech companies don't warrant months of work, but there's a big difference between intentionally leaving some bugs undebugged because some bugs aren't worth fixing and having poor debugging skills from never having ever debugged a serious bug and then not being able to debug any bug that isn't completely trivial.

Cultural attitudes can drive a lot more than individual skills like debugging. Centaur had, per capita, by far the lowest serious production bug rate of any company I've worked for, at well under one per year with ~100 engineers. By comparison, I've never worked on a team 1/10th that size that didn't have at least 10x the rate of serious production issues. Like most startups, Centaur was very light on process and it was also much lighter on incentives than the big tech companies I've worked for.

One component of this was that there was a culture of owning problems, regardless of what team you were on. If you saw a problem, you'd fix it, or, if there was a very obvious owner, you'd tell them about the problem and they'd fix it. There weren't roadmaps, standups, kanban, or anything else to get people to work on important problems. People did it without needed to be reminded or prompted.

That's the opposite of what I've seen at two of the three big tech companies I've worked for, where the median person avoids touching problems outside of their team's mandate like the plague, and someone who isn't politically savvy who brings up a problem to another team will get a default answer of "sorry, this isn't on our roadmap for the quarter, perhaps we can put this on the roadmap in [two quarters from now]", with the same response repeated to anyone naive enough to bring up the same issue two quarters later. At every tech company I've worked for, huge, extremely costly, problems slip through the cracks all the time because no one wants to pick them up. I never observed that happening at Centaur.

A side effect of big company tech culture is that someone who wants to actually do the right thing can easily do very high (positive) impact work by just going around and fixing problems that any intern could solve, if they're willing to ignore organizational processes and incentives. You can't shake a stick without hitting a problem that's worth more to the company than my expected lifetime earnings and it's easy to knock off multiple such problems per year. Of course, the same forces that cause so many trivial problems to not get solved mean that people who solve those problems don't get rewarded for their work4.

Conversely, in eight years at Centaur, I only found one trivial problem whose fix was worth more than I'll earn in my life because, in general, problems would get solved before they got to that point. I've seen various big company attempts to fix this problem using incentives (e.g., monetary rewards for solving important problems) and process (e.g., making a giant list of all projects/problems, on the order of 1000 projects, and having a single person order them, along with a bureaucratic system where everyone has to constantly provide updates on their progress via JIRA so that PMs can keep sending progress updates to the person who's providing a total order over the work of thousands of engineers5), but none of those attempts have worked even half as well as having a culture of ownership (to be fair to incentives, I've heard that FB uses monetary rewards to good effect, but I've failed FB's interview three times, so I haven't been able to observe how that works myself).

Another component that resulted in a relatively low severe bug rate was that, across the company at Centaur, people cared about quality in a way that I've never seen at a team level let alone at an org level at a big tech company. When you have a collection of people who care about quality and feel that no issue is off limits, you'll get quality. And when you onboard people, as long as you don't do it so quickly that the culture is overwhelmed by the new hires, they'll also tend to pick up the same habits and values, especially when you hire new grads. While it's not exactly common, there are plenty of small firms out there with a culture of excellence that generally persists without heavyweight processes or big incentives, but this doesn't work at big tech companies since they've all gone through a hypergrowth period where it's impossible to maintain such extreme (by mainstream standards) cultural values.

So far, we've mainly discussed companies transmitting culture to people, but something that I think is no less important is how people then carry that culture with them when they leave. I've been reasonably successful since changing careers from hardware to software and I think that, among the factors that are under my control, one of the biggest ones is that I picked up effective cultural values from the first place I worked full-time and continue to operate as in the same way, which is highly effective. I've also seen this in other people who, career-wise, "grew up" in a culture of excellence and then changed to a different field where there's even less direct skill transfer, e.g., from skiing to civil engineering. Relatedly, if you read books from people who discuss the reasons why they were very effective in their field, e.g., Practical Shooting by Brian Enos, Playing to Win by Dan Sirlin, etc., the books tend to contain the same core ideas (serious observation and improvement of skills, the importance of avoiding emotional self-sabotage, the importance of intuition, etc.).

Anyway, I think that cultural transmission of values and skills is an underrated part of choosing a job (some things I would consider overrated are prestige and general reputation and that people should be thoughtful about what cultures they spend time in because not many people are able to avoid at least somewhat absorbing the cultural values around them6.

Although this post is oriented around tech, there's nothing specific to tech about this. A classic example is how idealistic students will go to law school with the intention of doing "save the world" type work and then absorb the prestige-transmitted cultural values of students around then go into the most prestigious job they can get which, when it's not a clerkship, will be a "BIGLAW" job that's the opposite of "save the world" work. To first approximation, everyone thinks "that will never happen to me", but from having watched many people join organizations where they initially find the values and culture very wrong, almost no one is able to stay without, to some extent, absorbing the values around them; very few people are ok with everyone around them looking at them like they're an idiot for having the wrong values.

Appendix: Bay area culture

One thing I admire about the bay area is how infectious people's attitudes are with respect to trying to change the world. Everywhere I've lived, people gripe about problems (the mortgage industry sucks, selling a house is high friction, etc.). Outside of the bay area, it's just griping, but in the bay, when I talk to someone who was griping about something a year ago, there's a decent chance they've started a startup to try to address one of the problems they're complaining about. I don't think that people in the bay area are fundamentally different from people elsewhere, it's more that when you're surrounded by people who are willing to walk away from their jobs to try to disrupt an entrenched industry, it seems pretty reasonable to do the same thing (which also leads to network effects that make it easier from a "technical" standpoint, e.g., easier fundraising). There's a kind of earnestness in these sorts of complaints and attempts to fix them that's easy to mock, but that earnestness is something I really admire.

Of course, not all of bay area culture is positive. The bay has, among other things, a famously flaky culture to an extent I found shocking when I moved there. Relatively early on in my time there, I met some old friends for dinner and texted them telling them I was going to be about 15 minutes late. They were shocked when I showed up because they thought that saying that I was going to be late actually meant that I wasn't going to show up (another norm that surprised me that's an even more extreme version was that, for many people, not confirming plans shortly before their commencement means that the person has cancelled, i.e., plans are cancelled by default).

A related norm that I've heard people complain about is how management and leadership will say yes to everything in a "people pleasing" move to avoid conflict, which actually increases conflict as people who heard "yes" as a "yes" and not as "I'm saying yes to avoid saying no but don't actually mean yes" are later surprised that "yes" meant "no".

Appendix: Centaur's hiring process

One comment people sometimes have when I talk about Centaur is that they must've had some kind of incredibly rigorous hiring process that resulted in hiring elite engineers, but the hiring process was much less selective than any "brand name" big tech company I've worked for (Google, MS, and Twitter) and not obviously more selective than boring, old school, companies I've worked for (IBM and Micron). The "one weird trick" was onboarding, not hiring.

For new grad hiring (and, proportionally, we hired a lot of new grads), recruiting was more difficult than at any other company I'd worked for. Senior hiring wasn't difficult because Centaur had a good reputation locally, in Austin, but among new grads, no one had heard of us and no one wanted to work for us. When I recruited at career fairs, I had to stand out in front of our booth and flag down people who were walking by to get anyone to talk to us. This meant that we couldn't be picky about who we interviewed. We really ramped up hiring of new grads around the time that Jeff Atwood popularized the idea that there are a bunch of fake programmers out there applying for jobs and that you'd end up with programmers who can't program if you don't screen people out with basic coding questions in his very influential post, Why Can't Programmers.. Program? (the bolding below is his):

I am disturbed and appalled that any so-called programmer would apply for a job without being able to write the simplest of programs. That's a slap in the face to anyone who writes software for a living. ... It's a shame you have to do so much pre-screening to have the luxury of interviewing programmers who can actually program. It'd be funny if it wasn't so damn depressing

Since we were a relatively coding oriented hardware shop (verification engineers primarily wrote software and design engineers wrote a lot of tooling), we tried asking a simple coding question where people were required to code up a function to output Fibonacci numbers given a description of how to compute them (the naive solution was fine; a linear time or faster solution wasn't necessary). We dropped that question because no one got it without being walked through the entire thing in detail, which meant that the question had zero discriminatory power for us.

Despite not really asking a coding question, people did things like write hairy concurrent code (internal processor microcode, which often used barriers as the concurrency control mechanism) and create tools at a higher velocity and lower bug rate than I've seen anywhere else I've worked.

We were much better off avoiding hiring the way everyone else was because that meant we tried to and did hire people that other companies weren't competing over. That wouldn't make sense if other companies were using techniques that were highly effective but other companies were doing things like asking people to code FizzBuzz and then whiteboard some algorithms. While one might expect that doing algorithms interviews would result in hiring people who can solve the exact problems people ask about in interviews, but this turns out not to be the case. The other thing we did was have much less of a prestige filter than most companies, which also let us hire great engineers that other companies wouldn't even consider.

We did have some people who didn't work out, but it was never because they were "so-called programmers" who couldn't "write the simplest of programs". I do know of two cases of "fake programmers being hired who literally couldn't program, but both were at prestigious companies that have among the most rigorous coding interviews done at tech companies. In one case, it was discovered pretty quickly that the person couldn't code and people went back to review security footage from the interview and realized that the person who interviewed wasn't the person who showed up to do the job. In the other, the person was able to sneak under the radar at Google for multiple years before someone realized that the person never actually wrote any code and tasks only got completed when they got someone else to do the task. The person who realized eventually scheduled a pair programming session, where they discovered that the person wasn't able to write a loop, didn't know the difference between = and ==, etc., despite being a "senior SWE" (L5/T5) at Google for years.

I'm not going to say that having coding questions will never save you from hiring a fake programmer, but the rate of fake programmers appears to be very low enough that a small company can go a decade without hiring a fake programmer without asking a coding question and larger companies that are targeted by scammers still can't really avoid them even after asking coding questions.

Appendix: importing culture

Although this post is about how company culture impacts employees, of course employees impact company culture as well. Something that seems underrated in hiring, especially of senior leadership and senior ICs, is how they'll impact culture. Something I've repeatedly seen, both up close, and from a distance, is the hiring of a new senior person who manages to import their culture, which isn't compatible with the existing company's culture, causing serious problems and, frequently, high attrition, as things settle down.

Now that I've been around for a while, I've been in the room for discussions on a number of very senior hires and I've never seen anyone else bring up whether or not someone will import incompatible cultural values other than really blatant issues, like the person being a jerk or making racist or sexist comments in the interview.

Thanks to Peter Bhat Harkins, Laurence Tratt, Julian Squires, Anja Boskovic, Tao L., Justin Blank, Ben Kuhn, V. Buckenham, Mark Papadakis, and Jamie Brandon for comments/corrections/discussion.


  1. What countries actually have low defect rate manufacturing is often quite different from the general public reputation. To see this, you really need to look at the data, which is often NDA'd and generally only spread in "bar room" discussions. [return]
  2. : Centaur had what I sometimes called "the world's stupidest business model", competing with Intel on x86 chips starting in 1995, so it needed an extremely high level of productivity to survive. Through the bad years, AMD survived by selling off pieces of itself to fund continued x86 development and every other competitor (Rise, Cyrix, TI, IBM, UMC, NEC, and Transmeta) got wiped out. If you compare Centaur to the longest surviving competitor that went under, Transmeta, Centaur just plain shipped more quickly, which is a major reason that Centaur was able to survive until 2021 (when it was pseudo-acqui-hired by Intel) and Transmeta went in 2009 under after burning through ~$1B of funding (including payouts from lawsuits). Transmeta was founded in 1995 and shipped its first chip in 2000, which was considered a normal tempo for the creation of a new CPU/microarchitecture at the time; Centaur shipped its first chip in 1997 and continued shipping at a high cadence until 2010 or so (how things got slower and slower until the company stalled out and got acqui-hired is a topic for another post). [return]
  3. This person initially thought the processes and values on their first team were absurd before the cognitive dissonance got to them and they became a staunch advocate of the company's culture, which is typical for folks joining a company that has obviously terrible practices. [return]
  4. This illustrates one way in which incentives and culture are non-independent. What I've seen in places where this kind of work isn't rewarded is that, due to the culture, making these sorts of high-impact changes frequently requires burnout inducing slogs, at the end of which there is no reward, which causes higher attrition among people who have a tendency to own problems and do high-impact work. What I've observed in environments like this is that the environment differentially retains people who don't want to own problems, which then makes make more difficult and more burnout inducing for new people who join who attempt to fix serious problems. [return]
  5. I'm adding this note because, when I've described this to people, many people thought that this must be satire. It is not satire. [return]
  6. As with many other qualities, there can be high variance within a company as well as across companies. For example, there's a team I sometimes encountered at a company I've worked for that has a very different idea of customer service than most of the company and people who join that team and don't quickly bounce usually absorb their values.

    Much of the company has a pleasant attitude towards internal customers, but this team has a "the customer is always wrong" attitude. A funny side effect of this is that, when I dealt with the team, I got the best support when a junior engineer who hadn't absorbed the team's culture was on call, and sometimes a senior engineer would say something was impossible or infeasible only to have a junior engineer follow up and trivially solve the problem.

    [return]

Willingness to look stupid

2021-10-21 08:00:00

People frequently1 think that I'm very stupid. I don't find this surprising, since I don't mind if other people think I'm stupid, which means that I don't adjust my behavior to avoid seeming stupid, which results in people thinking that I'm stupid. Although there are some downsides to people thinking that I'm stupid, e.g., failing interviews where the interviewer very clearly thought I was stupid, I think that, overall, the upsides of being willing to look stupid have greatly outweighed the downsides.

I don't know why this one example sticks in my head but, for me, the most memorable example of other people thinking that I'm stupid was from college. I've had numerous instances where more people thought I was stupid and also where people thought the depths of my stupidity was greater, but this one was really memorable for me.

Back in college, there was one group of folks that, for whatever reason, stood out to me as people who really didn't understand the class material. When they talked, they said things that didn't make any sense, they were struggling in the classes and barely passing, etc. I don't remember any direct interactions but, one day, a friend of mine who also knew them remarked to me, "did you know [that group] thinks you're really dumb?". I found that interesting and asked why. It turned out the reason was that I asked really stupid sounding questions.

In particular, it's often the case that there's a seemingly obvious but actually incorrect reason something is true, a slightly less obvious reason the thing seems untrue, and then a subtle and complex reason that the thing is actually true2. I would regularly figure out that the seemingly obvious reason was wrong and then ask a question to try to understand the subtler reason, which sounded stupid to someone who thought the seemingly obvious reason was correct or thought that the refutation to the obvious but incorrect reason meant that the thing was untrue.

The benefit from asking a stupid sounding question is small in most particular instances, but the compounding benefit over time is quite large and I've observed that people who are willing to ask dumb questions and think "stupid thoughts" end up understanding things much more deeply over time. Conversely, when I look at people who have a very deep understanding of topics, many of them frequently ask naive sounding questions and continue to apply one of the techniques that got them a deep understanding in the first place.

I think I first became sure of something that I think of as a symptom of the underlying phenomenon via playing competitive video games when I was in high school. There were few enough people playing video games online back then that you'd basically recognize everyone who played the same game and could see how much everyone improved over time. Just like I saw when I tried out video games again a couple years ago, most people would blame external factors (lag, luck, a glitch, teammates, unfairness, etc.) when they "died" in the game. The most striking thing about that was that people who did that almost never became good and never became great. I got pretty good at the game3 and my "one weird trick" was to think about what went wrong every time something went wrong and then try to improve. But most people seemed more interested in making an excuse to avoid looking stupid (or maybe feeling stupid) in the moment than actually improving, which, of course, resulted in them having many more moments where they looked stupid in the game.

In general, I've found willingness to look stupid to be very effective. Here are some more examples:

Although most of the examples above are "real life" examples, being willing to look stupid is also highly effective at work. Besides the obvious reason that it allows you to learn faster and become more effective, it also makes it much easier to find high ROI ideas. If you go after trendy or reasonable sounding ideas, to do something really extraordinary, you have to have better ideas/execution than everyone else working on the same problem. But if you're thinking about ideas that most people consider too stupid to consider, you'll often run into ideas that are both very high ROI as well as simple and easy that anyone could've done had they not dismissed the idea out of hand. It may still technically be true that you need to have better execution than anyone else who's trying the same thing, but if no one else trying the same thing, that's easy to do!

I don't actually have to be nearly as smart or work nearly as hard as most people to get good results. If I try to solve some a problem by doing what everyone else is doing and go looking for problems where everyone else is looking, if I want to do something valuable, I'll have to do better than a lot of people, maybe even better than everybody else if the problem is really hard. If the problem is considered trendy, a lot of very smart and hardworking people will be treading the same ground and doing better than that is very difficult. But I have a dumb thought, one that's too stupid sounding for anyone else to try, I don't necessarily have to be particularly smart or talented or hardworking to come up with valuable solutions. Often, the dumb solution is something any idiot could've come up with and the reason the problem hasn't been solved is because no one was willing to think the dumb thought until an idiot like me looked at the problem.

Overall, I view the upsides of being willing to look stupid as much larger than the downsides. When it comes to things that aren't socially judged, like winning a game, understanding something, or being able to build things due to having a good understanding, it's all upside. There can be downside for things that are "about" social judgement, like interviews and dates but, even there, I think a lot of things that might seem like downsides are actually upsides.

For example, if a date thinks I'm stupid because I ask them what a word means, so much so that they show it in their facial expression and/or tone of voice, I think it's pretty unlikely that we're compatible, so I view finding that out sooner rather than later as upside and not downside.

Interviews are the case where I think there's the most downside since, at large companies, the interviewer likely has no connection to the job or your co-workers, so them having a pattern of interaction that I would view as a downside has no direct bearing on the work environment I'd have if I were offered the job and took it. There's probably some correlation but I can probably get much more signal on that elsewhere. But I think that being willing to say things that I know have a good chance of causing people to think I'm stupid is a deeply ingrained enough habit that it's not worth changing just for interviews and I can't think of another context where the cost is nearly as high as it is in interviews. In principle, I could probably change how I filter what I say only in interviews, but I think that would be a very large amount of work and not really worth the cost. An easier thing to do would be to change how I think so that I reflexively avoid thinking and saying "stupid" thoughts, which a lot of folks seem to do, but that seems even more costly.

Appendix: do you try to avoid looking stupid?

On reading a draft of this, Ben Kuhn remarked,

[this post] caused me to realize that I'm actually very bad at this, at least compared to you but perhaps also just bad in general.

I asked myself "why can't Dan just avoid saying things that make him look stupid specifically in interviews," then I started thinking about what the mental processes involved must look like in order for that to be impossible, and realized they must be extremely different from mine. Then tried to think about the last time I did something that made someone think I was stupid and realized I didn't have a readily available example)

One problem I expect this post to have is that most people will read this and decide that they're very willing to look stupid. This reminds me of how most people, when asked, think that they're creative, innovative, and take big risks. I think that feels true since people often operate at the edge of their comfort zone, but there's a difference between feeling like you're taking big risks and taking big risks, e.g., when asked, someone I know who is among the most conservative people I know thinks that they take a lot of big risks and names things like sometimes jaywalking as risk that they take.

This might sound ridiculous, as ridiculous as saying that I run into hundreds to thousands of software bugs per week, but I think I run into someone who thinks that I'm an idiot in a way that's obvious to me around once a week. The car insurance example is from a few days ago, and if I wanted to think of other recent examples, there's a long string of them.

If you don't regularly have people thinking that you're stupid, I think it's likely that at least one of the following is true

I think the last one of those is unlikely because, while I sometimes have interactions like the school one described, where the people were too nice to tell me that they think I'm stupid and I only found out via a third party, just as often, the person very clearly wants me to know that they think I'm stupid. The way it happens reminds me of being a pedestrian in NYC, where, when a car tries to cut you off when you have right of way and fails (e.g., when you're crossing a crosswalk and have the walk signal and the driver guns it to try to get in front of you to turn right), the driver will often scream at you and gesture angrily until you acknowledge them and, if you ignore them, will try very hard to get your attention. In the same way that it seems very important to some people who are angry that you know they're angry, many people seem to think it's very important that you know that they think that you're stupid and will keep increasing the intensity of their responses until you acknowledge that they think you're stupid.

One thing that might be worth noting is that I don't go out of my way to sound stupid or otherwise be non-conformist. If anything, it's the opposite. I generally try to conform in areas that aren't important to me when it's easy to conform, e.g., I dressed more casually in the office on the west coast than on the east coast since it's not important to me to convey some particular image based on how I dress and I'd rather spend my "weirdness points" on pushing radical ideas than on dressing unusually. After I changed how I dressed, one of the few people in the office who dressed really sharply in a way that would've been normal in the east coast office jokingly said to me, "so, the west coast got to you, huh?" and a few other people remarked that I looked a lot less stuffy/formal.

Another thing to note is that "avoiding looking stupid" seems to usually go beyond just filtering out comments or actions that might come off as stupid. Most people I talk to (and Ben is an exception here) have a real aversion evaluating stupid thoughts and (I'm guessing) also to having stupid thoughts. When I have an idea that sounds stupid, it's generally (and again, Ben is an exception here) extremely difficult to get someone to really consider the idea. Instead, most people reflexively reject the idea without really engaging with it at all and (I'm guessing) the same thing happens inside their heads when a potentially stupid sounding thought might occur to them. I think the danger here is not having a concious process that lets you decide to broadcast or not broadcast stupid sounding thoughts (that seems great if it's low overhead), and instead it's having some non-concious process automatically reject thinking about stupid sounding things.

Of course, stupid-sounding thoughts are frequently wrong, so, if you're not going to rely on social proof to filter out bad ideas, you'll have to hone your intuition or find trusted friends/colleagues who are able to catch your stupid-sounding ideas that are actually stupid. That's beyond the scope of this post. but I'll note that because almost no one attempts to hone their intuition for this kind of thing, it's very easy to get relatively good at it by just trying to do it at all.

Appendix: stories from other people

A disproportionate fraction of people whose work I really respect operate in a similar way to me with respect to looking stupid and also have a lot of stories about looking stupid.

One example from Laurence Tratt is from when he was job searching:

I remember being rejected from a job at my current employer because a senior person who knew me told other people that I was "too stupid". For a long time, I found this bemusing (I thought I must be missing out on some deep insights), but eventually I found it highly amusing, to the point I enjoy playing with it.

Another example: the other day, when I was talking to Gary Bernhardt, he told me a story about a time when he was chatting with someone who specialized in microservices on Kubernetes for startups and Gary said that he thought that most small (by transaction volume) startups could get away with being on a managed platform like Heroku or Google App Engine. The more Gary explained about his opinion, the more sure the person was that Gary was stupid.

Appendix: context

There are a lot of contexts that I'm not exposed to where it may be much more effective to train yourself to avoid looking stupid or incompetent, e.g., see this story by Ali Partovi about how his honesty led to Paul Graham's company being acquired by Yahoo instead of his own, which eventually led to Paul Graham founding YC and becoming one of the most well-known and influential people in the valley. If you're in a context where it's more important to look competent than to be competent then this post doesn't apply to you. Personally, I've tried to avoid such contexts, although they're probably more lucrative than the contexts I operate in.

Appendix: how to not care about looking stupid

This post has discussed what to do but not how to do it. Unfortunately, "how" is idiosyncratic and will vary greatly by person, so general advice here won't be effective. For myself, for better or for worse, this one came easy to me as I genuinely felt that I was fairly stupid during my formative years, so the idea that some random person thinks I'm stupid is like water off a duck's back.

It's hard to say why anyone feels a certain way about anything, but I'm going to guess that, for me, it was a combination of two things. First, my childhood friends were all a lot smarter than me. In the abstract, I knew that there were other kids out there who weren't obviously smarter than me but, weighted by interactions, most of my interactions were with my friends, which influenced how I felt more than reasoning about the distribution of people that were out there. Second, I grew up in a fairly abusive household and one of the minor things that went along with the abuse was regularly being yelled at, sometimes for hours on end, for being so shamefully, embarrassingly, stupid (I was in the same class as this kid and my father was deeply ashamed that I didn't measure up).

I wouldn't exactly recommend this path, but it seems to have worked out ok.

Thanks to Ben Kuhn, Laurence Tratt, Jeshua Smith, Niels Olson, Justin Blank, Tao L., Colby Russell, Anja Boskovic, David Coletta, @conservatif, and Ahmad Jarara for comments/corrections/discussion.


  1. This happens in a way that I notice something like once a week and it seems like it must happen much more frequently in ways that I don't notice. [return]
  2. A semi-recent example of this from my life is when I wanted to understand why wider tires have better grip. A naive reason one might think this is true is that wider tire = larger contact patch = more friction, and a lot of people seem to believe the naive reason. A reason the naive reason is wrong is because, as long as the tire is inflated semi-reasonably, given a fixed vehicle weight and tire pressure, the total size of the tire's contact patch won't change when tire width is changed. Another naive reason that the original naive reason is wrong is that, at a "spherical cow" level of detail, the level of grip is unrelated to the contact patch size.

    Most people I talked who don't race cars (e.g., autocross, drag racing, etc.) and the top search results online used the refutation to the naive reason plus an incorrect application of high school physics to incorrectly conclude that varying tire width has no effect on grip.

    But there is an effect and the reason is subtler than more width = larger contact patch.

    [return]
  3. I was arguably #1 in the world one season, when I put up a statistically dominant performance and my team won every game I played even though I disproportionately played in games against other top teams (and we weren't undefeated and other top players on the team played in games we lost). [return]

What to learn

2021-10-18 08:00:00

It's common to see people advocate for learning skills that they have or using processes that they use. For example, Steve Yegge has a set of blog posts where he recommends reading compiler books and learning about compilers. His reasoning is basically that, if you understand compilers, you'll see compiler problems everywhere and will recognize all of the cases where people are solving a compiler problem without using compiler knowledge. Instead of hacking together some half-baked solution that will never work, you can apply a bit of computer science knowledge to solve the problem in a better way with less effort. That's not untrue, but it's also not a reason to study compilers in particular because you can say that about many different areas of computer science and math. Queuing theory, computer architecture, mathematical optimization, operations research, etc.

One response to that kind of objection is to say that one should study everything. While being an extremely broad generalist can work, it's gotten much harder to "know a bit of everything" and be effective because there's more of everything over time (in terms of both breadth and depth). And even if that weren't the case, I think saying “should” is too strong; whether or not someone enjoys having that kind of breadth is a matter of taste. Another approach that can also work, one that's more to my taste, is to, as Gian Carlo Rota put it, learn a few tricks:

A long time ago an older and well known number theorist made some disparaging remarks about Paul Erdos' work. You admire contributions to mathematics as much as I do, and I felt annoyed when the older mathematician flatly and definitively stated that all of Erdos' work could be reduced to a few tricks which Erdos repeatedly relied on in his proofs. What the number theorist did not realize is that other mathematicians, even the very best, also rely on a few tricks which they use over and over. Take Hilbert. The second volume of Hilbert's collected papers contains Hilbert's papers in invariant theory. I have made a point of reading some of these papers with care. It is sad to note that some of Hilbert's beautiful results have been completely forgotten. But on reading the proofs of Hilbert's striking and deep theorems in invariant theory, it was surprising to verify that Hilbert's proofs relied on the same few tricks. Even Hilbert had only a few tricks!

If you look at how people succeed in various fields, you'll see that this is a common approach. For example, this analysis of world-class judo players found that most rely on a small handful of throws, concluding1

Judo is a game of specialization. You have to use the skills that work best for you. You have to stick to what works and practice your skills until they become automatic responses.

If you watch an anime or a TV series "about" fighting, people often improve by increasing the number of techniques they know because that's an easy thing to depict but, in real life, getting better at techniques you already know is often more effective than having a portfolio of hundreds of "moves".

Relatedly, Joy Ebertz says:

One piece of advice I got at some point was to amplify my strengths. All of us have strengths and weaknesses and we spend a lot of time talking about ‘areas of improvement.’ It can be easy to feel like the best way to advance is to eliminate all of those. However, it can require a lot of work and energy to barely move the needle if it’s truly an area we’re weak in. Obviously, you still want to make sure you don’t have any truly bad areas, but assuming you’ve gotten that, instead focus on amplifying your strengths. How can you turn something you’re good at into your superpower?

I've personally found this to be true in a variety of disciplines. While it's really difficult to measure programmer effectiveness in anything resembling an objective manner, this isn't true of some things I've done, like competitive video games (a very long time ago at this point, back before there was "real" money in competitive gaming), the thing that took me from being a pretty decent player to a very good player was abandoning practicing things I wasn't particularly good at and focusing on increasing the edge I had over everybody else at the few things I was unusually good at.

This can work for games and sports because you can get better maneuvering yourself into positions that take advantage of your strengths as well as avoiding situations that expose your weaknesses. I think this is actually more effective at work than it is in sports or gaming since, unlike in competitive endeavors, you don't have an opponent who will try to expose your weaknesses and force you into positions where your strengths are irrelevant. If I study queuing theory instead of compilers, a rival co-worker isn't going to stop me from working on projects where queuing theory knowledge is helpful and leave me facing a field full of projects that require compiler knowledge.

One thing that's worth noting is that skills don't have to be things people would consider fields of study or discrete techniques. For the past three years, the main skill I've been applying and improving is something you might call "looking at data"; the term is in quotes because I don't know of a good term for it. I don't think it's what most people would think of as "statistics", in that I don't often need to do anything as sophisticated as logistic regression, let alone actually sophisticated. Perhaps one could argue that this is something data scientists do, but if I look at what I do vs. what data scientists we hire do as well as what we screen for in data scientist interviews, we don't appear to want to hire data scientists with the skill I've been working on nor do they do what I'm doing (this is a long enough topic that I might turn it into its own post at some point).

Unlike Matt Might or Steve Yegge, I'm not going to say that you should take a particular approach, but I'll say that working on a few things and not being particularly well rounded has worked for me in multiple disparate fields and it appears to work for a lot of other folks as well.

If you want to take this approach, this still leaves the question of what skills to learn. This is one of the most common questions I get asked and I think my answer is probably not really what people are looking for and not very satisfying since it's both obvious and difficult to put into practice.

For me, two ingredients for figuring out what to spend time learning are having a relative aptitude for something (relative to other things I might do, not relative to other people) and also having a good environment in which to learn. To say that someone should look for those things is so vague that's it's nearly useless, but it's still better than the usual advice, which boils down to "learn what I learned", which results in advice like "Career pro tip: if you want to get good, REALLY good, at designing complex and stateful distributed systems at scale in real-world environments, learn functional programming. It is an almost perfectly identical skillset." or the even more extreme claims from some language communities, like Chuck Moore's claim that Forth is at least 100x as productive as boring languages.

I took generic internet advice early in my career, including language advice (this was when much of this kind of advice was relatively young and it was not yet possible to easily observe that, despite many people taking advice like this, people who took this kind of advice were not particularly effective and people who are particularly effective were not likely to have taken this kind of advice). I learned Haskell, Lisp, Forth, etc. At one point in my career, I was on a two person team that implemented what might still be, a decade later, the highest performance Forth processor in existence (it was a 2GHz IPC-oriented processor) and I programmed it as well (there were good reasons for this to be a stack processor, so Forth seemed like as good a choice as any). Like Yossi Kreinin, I think I can say that I spent more effort than most people have becoming proficient in Forth, and like him, not only did I not find it find it to be a 100x productivity tool, it wasn't clear that it would, in general, even be 1x on productivity. To be fair, a number of other tools did better than 1x on productivity but, overall, I think following internet advice was very low ROI and the things that I learned that were high ROI weren't things people were recommending.

In retrospect, when people said things like "Forth is very productive", what I suspect they really meant was "Forth makes me very productive and I have not considered how well this generalizes to people with different aptitudes or who are operating in different contexts". I find it totally plausible that Forth (or Lisp or Haskell or any other tool or technique) does work very well for some particular people, but I think that people tend to overestimate how much something working for them means that it works for other people, making advice generally useless because it doesn't distinguish between advice that's aptitude or circumstance specific and generalizable advice, which is in stark contrast to fields where people actually discuss the pros and cons of particular techniques2.

While a coach can give you advice that's tailored to you 1 on 1 or in small groups, that's difficult to do on the internet, which is why the best I can do here is the uselessly vague "pick up skills that are suitable for you". Just for example, two skills that clicked for me are "having an adversarial mindset" and "looking at data". A perhaps less useless piece of advice is that, if you're having a hard time identifying what those might be, you can ask people who know you very well, e.g., my manager and Ben Kuhn independently named coming up with solutions that span many levels of abstraction as a skill of mine that I frequently apply (and I didn't realize I was doing that until they pointed it out).

Another way to find these is to look for things you can't help but do that most other people don't seem to do, which is true for me of both "looking at data" and "having an adversarial mindset". Just for example, on having an adversarial mindset, when a company I was working for was beta testing a new custom bug tracker, I filed some of the first bugs on it and put unusual things into the fields to see if it would break. Some people really didn't understand why anyone would do such a thing and were baffled, disgusted, or horrified, but a few people (including the authors, who I knew wouldn't mind), really got it and were happy to see the system pushed past its limits. Poking at the limits of a system to see where it falls apart doesn't feel like work to me; it's something that I'd have to stop myself from doing if I wanted to not do it, which made spending a decade getting better at testing and verification techniques felt like something hard not to do and not work. Looking deeply into data is one I've spent more than a decade on at this point and it's another one that, to me, emotionally feels almost wrong to not improve at.

That these things are suited to me is basically due to my personality, and not something inherent about human beings. Other people are going to have different things that really feel easy/right for them, which is great, since if everyone was into looking at data and no one was into building things, that would be very problematic (although, IMO, looking at data is, on average, underrated).

The other major ingredient in what I've tried to learn is finding environments that are conducive to learning things that line up with my skills that make sense for me. Although suggesting that other people do the same sounds like advice that's so obvious that it's useless, based on how I've seen people select what team and company to work on, I think that almost nobody does this and, as a result, discussing this may not be completely useless.

An example of not doing this which typifies what I usually see is a case I just happened to find out about because I chatted with a manager about why their team had lost their new full-time intern conversion employee. I asked them about it since it was unusual for that manager to lose anyone since they're very good at retaining people and have low turnover on their teams. It turned out that their intern had wanted to work on infra, but had joined this manager's product team because they didn't know that they could ask to be on a team that matched their preferences. After the manager found out, the manager wanted the intern to be happy and facilitated a transfer to an infra team. In this case, this was a double whammy since the new hire doubly didn't consider working in an environment conducive for learning the skills they wanted. They made no attempt to work in the area they were interested in and then they joined a company that has a dysfunctional infra org that generally has poor design and operational practices, making the company a relatively difficult place to learn about infra on top of not even trying to land on an infra team. While that's an unusually bad example, in the median case that I've seen, people don't make decisions that result in particularly good outcomes with respect to learning even though good opportunities to learn are one of the top things people say that they want.

For example, Steve Yegge has noted:

The most frequently-asked question from college candidates is: "what kind of training and/or mentoring do you offer?" ... One UW interviewee just told me about Ford Motor Company's mentoring program, which Ford had apparently used as part of the sales pitch they do for interviewees. [I've elided the details, as they weren't really relevant. -stevey 3/1/2006] The student had absorbed it all in amazing detail. That doesn't really surprise me, because it's one of the things candidates care about most.

For myself, I was lucky that my first job, Centaur, was a great place to develop having an adversarial mindset with respect to testing and verification. When I compare what the verification team there accomplished, it's comparable to peer projects at other companies that employed much larger teams to do very similar things with similar or worse effectiveness, implying that the team was highly productive, which made that a really good place to learn.

Moreover, I don't think I could've learned as quickly on my own or by trying to follow advice from books or the internet. I think that people who are really good at something have too many bits of information in their head about how to do it for that information to really be compressible into a book, let alone a blog post. In sports, good coaches are able to convey that kind of information over time, but I don't know of anything similar for programming, so I think the best thing available for learning rate is to find an environment that's full of experts3.

For "looking at data", while I got a lot better at it from working on that skill in environments where people weren't really taking data seriously, the rate of improvement during the past few years, where I'm in an environment where I can toss ideas back and forth with people who are very good at understanding the limitations of what data can tell you as well as good at informing data analysis with deep domain knowledge, has been much higher. I'd say that I improved more at this in each individual year at my current job than I did in the decade prior to my current job.

One thing to perhaps note is that the environment, how you spend your day-to-day, is inherently local. My current employer is probably the least data driven of the three large tech companies I've worked for, but my vicinity is a great place to get better at looking at data because I spend a relatively large fraction of my time working with people who are great with data, like Rebecca Isaacs, and a relatively small fraction of the time working with people who don't take data seriously.

This post has discussed some strategies with an eye towards why they can be valuable, but I have to admit that my motivation for learning from experts wasn’t to create value. It's more that I find learning to be fun and there are some areas where I'm motivated enough to apply the skills regardless of the environment, and learning from experts is such a great opportunity to have fun that it's hard to resist. Doing this for a couple of decades has turned out to be useful, but that's not something I knew would happen for quite a while (and I had no idea that this would effectively transfer to a new industry until I changed from hardware to software).

A lot of career advice I see is oriented towards career or success or growth. That kind of advice often tells people to have a long-term goal or strategy in mind. It will often have some argument that's along the lines of "a random walk will only move you sqrt(n) in some direction whereas a directed walk will move you n in some direction". I don't think that's wrong, but I think that, for many people, that advice implicitly underestimates the difficulty of finding an area that's suited to you4, which I've basically done by trial and error.

Appendix: parts of the problem this post doesn't discuss in detail

One major topic not discussed is how to balance what "level" of skill to work on, which could be something high level, like "looking at data", to something lower level, like "Bayesian multilevel models", to something even lower level, like "typing speed". That's a large enough topic that it deserves its own post that I'd expect to be longer than this one but, for now, here's a comment from Gary Bernhardt about something related that I believe also applies to this topic.

Another major topic that's not discussed here is picking skills that are relatively likely to be applicable. It's a little too naive to just say that someone should think about learning skills they have an aptitude for without thinking about applicability.

But while it's pretty easy to pick out skills where it's very difficult to either have an impact on the world or make a decent amount of money or achieve whatever goal you might want to achieve, like "basketball" or "boxing", it's harder to pick between plausible skills, like computer architecture vs. PL.

But I think semi-reasonable sounding skills are likely enough to be high return if they're a good fit for someone that trial and error among semi-reasonable sounding skills is fine, although it probably helps to be able to try things out quickly

Thanks to Ben Kuhn, Alexey Guzey, Marek Majkowski, Nick Bergson-Shilcock, @bekindtopeople2, Aaron Levin, Milosz Danczak, Anja Boskovic, John Doty, Justin Blank, Mark Hansen, "wl", and Jamie Brandon for comments/corrections/discussion.


  1. This is an old analysis. If you were to do one today, you'd see a different mix of throws, but it's still the case that you see specialists having a lot of success, e.g., Riner with osoto gari [return]
  2. To be fair to blanket, context free, advice, to learn a particular topic, functional programming really clicked for me and I could imagine that, if that style of thinking wasn't already natural for me (as a result of coming from a hardware background), the advice that one should learn functional programming because it will change how you think about problems might've been useful for me, but on the other hand, that means that the advice could've just as easily been to learn hardware engineering. [return]
  3. I don't have a large enough sample nor have I polled enough people to have high confidence that this works as a general algorithm but, for finding groups of world-class experts, what's worked for me is finding excellent managers. The two teams I worked on with the highest density of world-class experts have been teams under really great management. I have a higher bar for excellent management than most people and, from having talked to many people about this, almost no one I've talked to has worked for or even knows a manager as good as one I would consider to be excellent (and, general, both the person I'm talking to agrees with me on this, indicating that it's not the case that they have a manager who's excellent in dimensions I don't care about and vice versa); from discussions about this, I would guess that a manager I think of as excellent is at least 99.9%-ile. How to find such a manager is a long discussion that I might turn into another post.

    Anyway, despite having a pretty small sample on this, I think the mechanism for this is plausible, in that the excellent managers I know have very high retention as well as a huge queue of people who want to work for them, making it relatively easy for them to hire and retain people with world-class expertise since the rest of the landscape is so bleak.

    A more typical strategy, one that I don't think generally works and also didn't work great for me when I tried it is to work on the most interesting sounding and/or hardest problems around. While I did work with some really great people while trying to work on interesting / hard problems, including one of the best engineers I've ever worked with, I don't think that worked nearly as well as looking for good management w.r.t. working with people I really want to learn from. I believe the general problem with this algorithm is the same problem with going to work in video games because video games are cool and/or interesting. The fact that so many people want to work on exciting sounding problems leads to dysfunctional environments that can persist indefinitely.

    In one case, I was on a team that had 100% turnover in nine months and it would've been six if it hadn't taken so long for one person to find a team to transfer to. In the median case, my cohort (people who joined around when I joined, ish) had about 50% YoY turnover and I think that people had pretty good reasons for leaving. Not only is this kind of turnover a sign that the environment is often a pretty unhappy one, these kinds of environments often differentially cause people who I'd want to work with and/or learn from to leave. For example, on the team I was on where the TL didn't believe in using version control, automated testing, or pipelined designs, I worked with Ikhwan Lee, who was great. Of course, Ikhwan left pretty quickly while the TL stayed and is still there six years later.

    [return]
  4. Something I've seen many times among my acquaintances is that people will pick a direction before they have any idea whether or not it's suitable for them. Often, after quite some time (more than a decade in some cases), they'll realize that they're actually deeply unhappy with the direction they've gone, sometimes because it doesn't match their temperament, and sometimes because it's something they're actually bad at. In any case, wandering around randomly and finding yourself sqrt(n) down a path you're happy with doesn't seem so bad compared to having made it n down a path you're unhappy with. [return]

Some reasons to work on productivity and velocity

2021-10-15 08:00:00

A common topic of discussion among my close friends is where the bottlenecks are in our productivity and how we can execute more quickly. This is very different from what I see in my extended social circles, where people commonly say that velocity doesn't matter. In online discussions about this, I frequently see people go a step further and assign moral valence to this, saying that it is actually bad to try to increase velocity or be more productive or work hard (see appendix for more examples).

The top reasons I see people say that productivity doesn't matter (or is actually bad) fall into one of three buckets:

I certainly agree that working on the right thing is important, but increasing velocity doesn't stop you from working on the right thing. If anything, each of these is a force multiplier for the other. Having strong execution skills becomes more impactful if you're good at picking the right problem and vice versa.

It's true that the gains from picking the right problem can be greater than the gains from having better tactical execution because the gains from picking the right problem can be unbounded, but it's also much easier to improve tactical execution and doing so also helps with picking the right problem because having faster execution lets you experiment more quickly, which helps you find the right problem.

A concrete example of this is a project I worked on to quantify the machine health of the fleet. The project discovered a number of serious issues (a decent fraction of hosts were actively corrupting data or had a performance problem that would increase tail latency by > 2 orders of magnitude, or both). This was considered serious enough that a new team was created to deal with the problem.

In retrospect, my first attempts at quantifying the problem were doomed and couldn't have really worked (or not in a reasonable amount of time, anyway). I spent a few weeks cranking through ideas that couldn't work and a critical part of getting to the idea that did work after "only" a few weeks was being able to quickly try out and discard ideas that didn't work. In part of a previous post, I described how long a tiny part of that process took and multiple people objected to that being impossibly fast in internet comments.

I find this a bit funny since I'm not a naturally quick programmer. Learning to program was a real struggle for me and I was pretty slow at it for a long time (and I still am in aspects that I haven't practiced). My "one weird trick" is that I've explicitly worked on speeding up things that I do frequently and most people have not. I view the situation as somewhat analogous to sports before people really trained. For a long time, many athletes didn't seriously train, and then once people started trying to train, the training was often misguided by modern standards. For example, if you read commentary on baseball from the 70s, you'll see people saying that baseball players shouldn't weight train because it will make them "muscle bound" (many people thought that weight lifting would lead to "too much" bulk, causing people to be slower, have less explosive power, and be less agile). But today, players get a huge advantage from using performance-enhancing drugs that increase their muscle-bound-ness, which implies that players could not get too "muscle bound" from weight training alone. An analogous comment to one discussed above would be saying that athletes shouldn't worry about power/strength and should increase their skill, but power increases returns to skill and vice versa.

Coming back to programming, if you explicitly practice and train and almost no one else does, you'll be able to do things relatively quickly compared to most people even if, like me, you don't have much talent for programming and getting started at all was a real struggle. Of course, there's always going to be someone more talented out there who's executing faster after having spent less time improving. But, luckily for me, relatively few people seriously attempt to improve, so I'm able to do ok.

Anyway, despite operating at a rate that some internet commenters thought was impossible, it took me weeks of dead ends to find something that worked. If I was doing things at a speed that people thought was normal, I suspect it would've taken long enough to find a feasible solution that I would've dropped the problem after spending maybe one or two quarters on it. The number of plausible-ish seeming dead ends was probably not unrelated to why the problem was still an open problem despite being a critical issue for years. Of course, someone who's better at having ideas than me could've solved the problem without the dead ends, but as we discussed earlier, it's fairly easy to find low hanging fruit on "execution speed" and not so easy to find low hanging fruit on "having better ideas". However, it's possible to, to a limited extent, simulate someone who has better ideas than me by being able to quickly try out and discard ideas (I also work on having better ideas, but I think it makes sense to go after the easier high ROI wins that are available as well). Being able to try out ideas quickly also improves the rate at which I can improve at having better ideas since a key part of that is building intuition by getting feedback on what works.

The next major objection is that speed at a particular task doesn't matter because time spent on that task is limited. At a high level, I don't agree with this objection because, while this may hold true for any particular kind of task, the solution to that is to try to improve each kind of task and not to reject the idea of improvement outright. A sub-objection people have is something like "but I spend 20 hours in unproductive meetings every week, so it doesn't matter what I do with my other time". I think this is doubly wrong, in that if you then only have 20 hours of potentially productive time, whatever productivity multiplier you have on that time still holds for your general productivity. Also, it's generally possible to drop out of meetings that are a lost cause and increase the productivity of meetings that aren't a lost cause1.

More generally, when people say that optimizing X doesn't help because they don't spend time on X and are not bottlenecked on X, that doesn't match my experience as I find I spend plenty of time bottlenecked on X for commonly dismissed Xs. I think that part of this is because getting faster at X can actually increase time spent on X due to a sort of virtuous cycle feedback loop of where it makes sense to spend time. Another part of this is illustrated in this comment by Fabian Giesen:

It is commonly accepted, verging on a cliche, that you have no idea where your program spends time until you actually profile it, but the corollary that you also don't know where you spend your time until you've measured it is not nearly as accepted.

When I've looked how people spend time vs. how people think they spend time, it's wildly inaccurate and I think there's a fundamental reason that, unless they measure, people's estimates of how they spend their time tends to be way off, which is nicely summed in by another Fabian Giesen quote, which happens to be about solving rubik's cubes but applies to other cognitive tasks:

Paraphrasing a well-known cuber, "your own pauses never seem bad while you're solving, because your brain is busy and you know what you're thinking about, but once you have a video it tends to become blindingly obvious what you need to improve". Which is pretty much the usual "don't assume, profile" advice for programs, but applied to a situation where you're concentrated and busy for the entire time, whereas the default assumption in programming circles seems to be that as long as you're actually doing work and not distracted or slacking off, you can't possibly be losing a lot of time

Unlike most people who discuss this topic online, I've actually looked at where my time goes and a lot of it goes to things that are canonical examples of things that you shouldn't waste time improving because people don't spend much time doing them.

An example of one of these, the most commonly cited bad-thing-to-optimize example that I've seen, is typing speed (when discussing this, people usually say that typing speed doesn't matter because more time is spent thinking than typing). But, when I look at where my time goes, a lot of it is spent typing.

A specific example is that I've written a number of influential docs at my current job and when people ask how long some doc took to write, they're generally surprised that the doc only took a day to write. As with the machine health example, a thing that velocity helps with is figuring out which docs will be influential. If I look at the docs I've written, I'd say that maybe 15% were really high impact (caused a new team to be created, changed the direction of existing teams, resulted in significant changes to the company's bottom line, etc.). Part of it is that I don't always know which ideas will resonate with other people, but part of it is also that I often propose ideas that are long shots because the ideas sound too stupid to be taken seriously (e.g., one of my proposed solutions to a capacity crunch was to, for each rack, turn off 10% of it, thereby increasing effective provisioned capacity, which is about as stupid sounding an idea as one could come up with). If I was much slower at writing docs, it wouldn't make sense to propose real long shot ideas. As things are today, if I think an idea has a 5% chance of success, in expectation, I need to spend ~20 days writing docs to have one of those land.

I spend roughly halve my writing time typing. If I typed at what some people say median typing speed is (40 WPM) instead of the rate some random typing test clocked me at (110 WPM), this would be a 0.5 + 0.5 * 110/40 = 1.875x slowdown, putting me at nearly 40 days of writing before a longshot doc lands, which would make that a sketchier proposition. If I hadn't optimized the non-typing part of my writing workflow as well, I think I would be, on net, maybe 10x slower2, which would put me at more like ~200 days per high impact longshot doc, which is enough that I think that I probably wouldn't write longshot docs3.

More generally, Fabian Giesen has noted that this kind of non-linear impact of velocity is common:

There are "phase changes" as you cross certain thresholds (details depend on the problem to some extent) where your entire way of working changes. ... ​​There's a lot of things I could in theory do at any speed but in practice cannot, because as iteration time increases it first becomes so frustrating that I can't do it for long and eventually it takes so long that it literally drops out of my short-term memory, so I need to keep notes or otherwise organize it or I can't do it at all.

Certainly if I can do an experiment in an interactive UI by dragging on a slider and see the result in a fraction of a second, at that point it's very "no filter", if you want to try something you just do it.

Once you're at iteration times in the low seconds (say a compile-link cycle with a statically compiled lang) you don't just try stuff anymore, you also spend time thinking about whether it's gonna tell you anything because it takes long enough that you'd rather not waste a run.

Once you get into several-minute or multi-hour iteration times there's a lot of planning to not waste runs, and context switching because you do other stuff while you wait, and note-taking/bookkeeping; also at this level mistakes are both more expensive (because a wasted run wastes more time) and more common (because your attention is so divided).

As you scale that up even more you might now take significant resources for a noticeable amount of time and need to get that approved and budgeted, which takes its own meetings etc.

A specific example of something moving from one class of item to another in my work was this project on metrics analytics. There were a number of proposals on how to solve this problem. There was broad agreement that the problem was important with no dissenters, but the proposals were all the kinds of things you'd allocate a team to work on through multiple roadmap cycles. Getting a project that expensive off the ground requires a large amount of organizational buy-in, enough that many important problems don't get solved, including this one. But it turned out, if scoped properly and executed reasonably, the project was actually something a programmer could create an MVP of in a day, which takes no organizational buy-in to get off the ground. Instead of needing to get multiple directors and a VP to agree that the problem is among the org's most important problems, you just need a person who thinks the problem is worth solving.

Going back to Xs where people say velocity doesn't matter because they don't spend a lot time on X, another one I see frequently is coding, and it is also not my personal experience that coding speed doesn't matter. For the machine health example discussed above, after I figured out something that would work, I spent one month working on basically nothing but that, coding, testing, and debugging. I think I had about 6 hours of meetings during that month, but other than that plus time spent eating, etc., I would go in to work, code all day, and then go home. I think it's much more difficult to compare coding speed across people because it's rare to see people do the same or very similar non-trivial tasks, so I won't try to compare to anyone else, but if I look at my productivity before I worked on improving it as compared to where I'm at now, the project probably would have been infeasible without the speedups I've found by looking at my velocity.

Amdahl's law based arguments can make sense when looking for speedups in a fixed benchmark, like a sub-task of SPECint, but when you have a system where getting better at a task increases returns to doing that task and can increase time spent on the task, it doesn't make sense to say that you shouldn't work on something because you spend a lot of time doing it. I spend time on things that are high ROI, but those things are generally only high ROI because I've spent time improving my velocity, which reduces the "I" in ROI.

The last major argument I see against working on velocity assigns negative moral weight to the idea of thinking about productivity and working on velocity at all. This kind of comment often assigns positive moral weight to various kinds of leisure, such as spending time with friends and family. I find this argument to be backwards. If someone thinks it's important to spend time with friends and family, an easy way to do that is to be more productive at work and spend less time working.

Personally, I deliberately avoid working long hours and I suspect I don't work more than the median person at my company, which is a company where I think work-life balance is pretty good overall. A lot of my productivity gains have gone to leisure and not work. Furthermore, deliberately working on velocity has allowed me to get promoted relatively quickly4, which means that I make more money than I would've made if I didn't get promoted, which gives me more freedom to spend time on things that I value.

For people that aren't arguing that you shouldn't think about productivity because it's better to focus on leisure and instead argue that you simply shouldn't think about productivity at all because it's unnatural and one should live a natural life, that ultimately comes down to personal preference, but for me, I value the things I do outside of work too much to not explicitly work on productivity at work.

As with this post on reasons to measure, while this post is about practical reasons to improve productivity, the main reason I'm personally motivated to work on my own productivity isn't practical. The main reason is that I enjoy the process of getting better at things, whether that's some nerdy board game, a sport I have zero talent at that will never have any practical value to me, or work. For me, a secondary reason is that, given that my lifespan is finite, I want to allocate my time to things that I value, and increasing productivity allows me to do more of that, but that's not a thought i had until I was about 20, at which point I'd already been trying to improve at most things I spent significant time on for many years.

Another common reason for working on productivity is that mastery and/or generally being good at something seems satisfying for a lot of people. That's not one that resonates with me personally, but when I've asked other people about why they work on improving their skills, that seems to be a common motivation.

A related idea, one that Holden Karnofsky has been talking about for a while, is that if you ever want to make a difference in the world in some way, it's useful to work on your skills even in jobs where it's not obvious that being better at the job is useful, because the developed skills will give you more leverage on the world when you switch to something that's more aligned with you want to achieve.

Appendix: one way to think about what to improve

Here's a framing I like from Gary Bernhardt (not set off in a quote block since this entire section, other than this sentence, is his).

People tend to fixate on a single granularity of analysis when talking about efficiency. E.g., "thinking is the most important part so don't worry about typing speed". If we step back, the response to that is "efficiency exists at every point on the continuum from year-by-year strategy all the way down to millisecond-by-millisecond keystrokes". I think it's safe to assume that gains at the larger scale will have the biggest impact. But as we go to finer granularity, it's not obvious where the ROI drops off. Some examples, moving from coarse to fine:

  1. The macro point that you started with is: programming isn't just thinking; it's thinking plus tactical activities like editing code. Editing faster means more time for thinking.
  2. But editing code costs more than just the time spent typing! Programming is highly dependent on short-term memory. Every pause to edit is a distraction where you can forget the details that you're juggling. Slower editing effectively weakens your short-term memory, which reduces effectiveness.
  3. But editing code isn't just hitting keys! It's hitting keys plus the editor commands that those keys invoke. A more efficient editor can dramatically increase effective code editing speed, even if you type at the same WPM as before.
  4. But each editor command doesn't exist in a vacuum! There are often many ways to make the same edit. A Vim beginner might type "hhhhxxxxxxxx" when "bdw" is more efficient. An advanced Vim user might use "bdw", not realizing that it's slower than "diw" despite having the same number of keystrokes. (In QWERTY keyboard layout, the former is all on the left hand, whereas the latter alternates left-right-left hands. At 140 WPM, you're typing around 14 keystrokes per second, so each finger only has 70 ms to get into position and press the key. Alternating hands leaves more time for the next finger to get into position while the previous finger is mid-keypress.)

We have to choose how deep to go when thinking about this. I think that there's clear ROI in thinking about 1-3, and in letting those inform both tool choice and practice. I don't think that (4) is worth a lot of thought. It seems like we naturally find "good enough" points there. But that also makes it a nice fence post to frame the others.

Appendix: more examples

etc.

Some positive examples of people who have used their productivity to "fund" things that they value include Andy Kelley (Zig), Jamie Brandon (various), Andy Matuschak (mnemonic medium, various), Saul Pwanson (VisiData), Andy Chu (Oil Shell). I'm drawing from programming examples, but you can find plenty of others, e.g., Nick Adnitt (Darkside Canoes) and, of course, numerous people who've retired to pursue interests that aren't work-like at all.

Appendix: another reason to avoid being productive

An idea that's become increasingly popular in my extended social circles at major tech companies is that one should avoid doing work and waste as much time as possible, often called "antiwork", which seems like a natural extension of "tryhard" becoming an insult. The reason given is often something like, work mainly enriches upper management at your employer and/or shareholders, who are generally richer than you.

I'm sympathetic to the argument and agree that upper management and shareholders capture most of the value from work. But as much as I sympathize with the idea of deliberately being unproductive to "stick it to the man", I value spending my time on things that I want enough that I'd rather get my work done quickly so I can do things I enjoy more than work. Additionally, having been productive in the past has given me good options for jobs, so I have work that I enjoy a lot more than my acquaintances in tech who have embraced the "antiwork" movement.

The less control you have over your environment, the more it makes sense to embrace "antiwork". Programmers at major tech companies have, relatively speaking, a lot of control over their environment, which is why I'm not "antiwork" even though I'm sympathetic to the cause.

Although it's about a different topic, a related comment from Prachee Avasthi about avoiding controversial work and avoiding pushing for necessary changes when pre-tenure ingrains habits that are hard break post-tenure. If one wants to be "antiwork" forever, that's not a problem, but if one wants to move the needle on something at some point, building "antiwork" habits while working for a major tech company will instill counterproductive habits.

Thanks to Fabian Giesen, Gary Bernhardt, Ben Kuhn, David Turner, Marek Majkowski, Anja Boskovic, Aaron Levin, Lifan Zeng, Justin Blank, Heath Borders, Tao L., Nehal Patel, @[email protected], and Jamie Brandon for comments/corrections/discussion


  1. When I look at the productiveness of meetings, there are some people who are very good at keeping meetings on track and useful. For example, one person who I've been in meetings with who is extraordinarily good at ensuring meetings are productive is Bonnie Eisenman. Early on in my current job, I asked her how she was so effective at keeping meetings productive and have been using that advice since then (I'm not nearly as good at it as she is, but even so, improving at this was a significant win for me). [return]
  2. 10x might sound like an implausibly large speedup on writing, but in a discussion on writing speed on a private slack, a well-known newsletter author mentioned that their net writing speed for a 5k word newsletter was a little under 2 words per minute (WPM). My net rate (including time spent editing, etc.) is over 20 WPM per doc.

    With a measured typing speed of 110 WPM, that might sound like I spend a small fraction of my time typing, but it turns out it's roughly half the time. If I look at my writing speed, it's much slower than my typing test speed and it seems that it's perhaps half the rate. If I look at where the actual time goes, roughly half of it goes to typing and half goes to thinking, semi-serially, which creates long pauses in my typing.

    If I look at where the biggest win here could come, it would be from thinking and typing in parallel, which is something I'd try to achieve by practicing typing more, not less. But even without being able to do that, and with above average typing speed, I still spend half of my time typing!

    The reason my net speed is well under the speed that I write is that I do multiple passes and re-write. Some time is spent reading as I re-write, but I read much more quickly than I write, so that's a pretty small fraction of time. In principle, I could adopt an approach that involves less re-writing, but I've tried a number of things that one might expect would lead to that goal and haven't found one that works for me (yet?).

    Although the example here is about work, this also holds for my personal blog, where my velocity is similar. If I wrote ten times slower than I do, I don't think I'd have much of a blog. My guess is that I would've written a few posts or maybe even a few drafts and not gotten to the point where I'd post and then stop.

    I enjoy writing and get a lot of value out of it in a variety of ways, but I value the other things in my life enough that I don't think writing would have a place in my life if my net writing speed were 2 WPM.

    [return]
  3. Another strategy would be to write shorter docs. There's a style of doc where that works well, but I frequently write docs where I leverage my writing speed to discuss a problem that would be difficult to convincingly discuss without a long document.

    One example of a reason that my docs is that I frequently work on problems that span multiple levels of the stack, which means that I end up presenting data from multiple levels of the stack as well as providing enough context about why the problem at some level drives a problem up or down the stack for people who aren't deeply familiar with that level of the stack, which is necessary since few readers will have strong familiarity with every level needed to understand the problem.

    In most cases, there have been previous attempts to motivate/fund work on the problem that didn't get traction because there wasn't a case linking an issue at one level of the stack to important issues at other levels of the stack. I could avoid problems that span many levels of the stack, but there's a lot of low hanging fruit among those sorts of problems for technical and organizational reasons, so I don't think it makes sense to ignore them just because it takes a day to write a document explaining the problem (although it might make sense if it took ten days, at least in cases where people might be skeptical of the solution).

    [return]
  4. Of course, promotions are highly unfair and being more productive doesn't guarantee promotion. If I just look at what things are correlated with level, it's not even clear to me that productivity is more strongly correlated with level than height, but among factors that are under my control, productivity is one of the easiest to change. [return]

The value of in-house expertise

2021-09-29 08:00:00

An alternate title for this post might be, "Twitter has a kernel team!?". At this point, I've heard that surprised exclamation enough that I've lost count of the number times that's been said to me (I'd guess that it's more than ten but less than a hundred). If we look at trendy companies that are within a couple factors of two in size of Twitter (in terms of either market cap or number of engineers), they mostly don't have similar expertise, often as a result of path dependence — because they "grew up" in the cloud, they didn't need kernel expertise to keep the lights on the way an on prem company does. While that makes it socially understandable that people who've spent their career at younger, trendier, companies, are surprised by Twitter having a kernel team, I don't think there's a technical reason for the surprise.

Whether or not it has kernel expertise, a company Twitter's size is going to regularly run into kernel issues, from major production incidents to papercuts. Without a kernel team or the equivalent expertise, the company will muddle through the issues, running into unnecessary problems as well as taking an unnecessarily long time to mitigate incidents. As an example of a critical production incident, just because it's already been written up publicly, I'll cite this post, which dryly notes:

Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug

What this implies but doesn't explicitly say is that this firewall misconfiguration was the most severe incident that's occured during my time at Twitter and I believe it's actually the most severe outage that Twitter has had since 2013 or so. As a company, we would've still been able to mitigate the issue without a kernel team or another team with deep Linux expertise, but it would've taken longer to understand why the initial fix didn't work, which is the last thing you want when you're debugging a serious outage. Folks on the kernel team were already familiar with the various diagnostic tools and debugging techniques necessary to quickly understand why the initial fix didn't work, which is not common knowledge at some peer companies (I polled folks at a number of similar-scale peer companies to see if they thought they had at least one person with the knowledge necessary to quickly debug the bug and the answer was no at many companies).

Another reason to have in-house expertise in various areas is that they easily pay for themselves, which is a special case of the generic argument that large companies should be larger than most people expect because tiny percentage gains are worth a large amount in absolute dollars. If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity, and Twitter’s kernel team has found many such changes. In addition to kernel patches that sometimes have that kind of impact, people will also find configuration issues, etc., that have that kind of impact.

So far, I've only talked about the kernel team because that's the one that most frequently elicits surprise from folks for merely existing, but I get similar reactions when people find out that Twitter has a bunch of ex-Sun JVM folks who worked on HotSpot, like Ramki Ramakrishna, Tony Printezis, and John Coomes. People wonder why a social media company would need such deep JVM expertise. As with the kernel team, companies our size that use the JVM run into weird issues and JVM bugs and it's helpful to have people with deep expertise to debug those kinds of issues. And, as with the kernel team, individual optimizations to the JVM can pay for the team in perpetuity. A concrete example is this patch by Flavio Brasil, which virtualizes compare and swap calls.

The context for this is that Twitter uses a lot of Scala. Despite a lot of claims otherwise, Scala uses more memory and is significantly slower than Java, which has a significant cost if you use Scala at scale, enough that it makes sense to do optimization work to reduce the performance gap between idiomatic Scala and idiomatic Java.

Before the patch, if you profiled our Scala code, you would've seen an unreasonably large amount of time spent in Future/Promise, including in cases where you might naively expect that the compiler would optimize the work away. One reason for this is that Futures use a compare-and-swap (CAS) operation that's opaque to JVM optimization. The patch linked above avoids CAS operations when the Future doesn't escape the scope of the method. This companion patch removes CAS operations in some places that are less amenable to compiler optimization. The two patches combined reduced the cost of typical major Twitter services using idiomatic Scala by 5% to 15%, paying for the JVM team in perpetuity many times over and that wasn't even the biggest win Flavio found that year.

I'm not going to do a team-by-team breakdown of teams that pay for themselves many times over because there are so many of them, even if I limit the scope to "teams that people are surprised that Twitter has".

A related topic is how people talk about "buy vs. build" discussions. I've seen a number of discussions where someone has argued for "buy" because that would obviate the need for expertise in the area. This can be true, but I've seen this argued for much more often than it is true. An example where I think this tends to be untrue is with distributed tracing. We've previously looked at some ways Twitter gets value out of tracing, which came out of the vision Rebecca Isaacs put into place. On the flip side, when I talk to people at peer companies with similar scale, most of them have not (yet?) succeeded at getting significant value from distributed tracing. This is so common that I see a viral Twitter thread about how useless distributed tracing is more than once a year. Even though we went with the more expensive "build" option, just off the top of my head, I can think of multiple uses of tracing that have returned between 10x and 100x the cost of building out tracing, whereas people at a number of companies that have chosen the cheaper "buy" option commonly complain that tracing isn't worth it.

Coincidentally, I was just talking about this exact topic to Pam Wolf, a civil engineering professor with experience in (civil engineering) industry on multiple continents, who had a related opinion. For large scale systems (projects), you need an in-house expert (owner's-side engineer) for each area that you don't handle in your own firm. While it's technically possible to hire yet another firm to be the expert, that's more expensive than developing or hiring in-house expertise and, in the long run, also more risky. That's pretty analogous to my experience working as an electrical engineer as well, where orgs that outsource functions to other companies without retaining an in-house expert pay a very high cost, and not just monetarily. They often ship sub-par designs with long delays on top of having high costs. "Buying" can and often does reduce the amount of expertise necessary, but it often doesn't remove the need for expertise.

This related to another common abstract argument that's commonly made, that companies should concentrate on "their area of comparative advantage" or "most important problems" or "core business need" and outsource everything else. We've already seen a couple of examples where this isn't true because, at a large enough scale, it's more profitable to have in-house expertise than not regardless of whether or not something is core to the business (one could argue that all of the things that are moved in-house are core to the business, but that would make the concept of coreness useless). Another reason this abstract advice is too simplistic is that businesses can somewhat arbitrarily choose what their comparative advantage is. A large1 example of this would be Apple bringing CPU design in-house. Since acquiring PA Semi (formerly the team from SiByte and, before that, a team from DEC) for $278M, Apple has been producing the best chips in the phone and laptop power envelope by a pretty large margin. But, before the purchase, there was nothing about Apple that made the purchase inevitable, that made CPU design an inherent comparative advantage of Apple. But if a firm can pick an area and make it an area of comparative advantage, saying that the firm should choose to concentrate on its comparative advantage(s) isn't very helpful advice.

$278M is a lot of money in absolute terms, but as a fraction of Apple's resources, that was tiny and much smaller companies also have the capability to do cutting edge work by devoting a small fraction of their resources to it, e.g., Twitter, for a cost that any $100M company could afford, created novel cache algorithms and data structures and is doing other cutting edge cache work. Having great cache infra isn't any more core to Twitter's business than creating a great CPU is to Apple's, but it is a lever that Twitter can use to make more money than it could otherwise.

For small companies, it doesn't make sense to have in-house experts for everything the company touches, but companies don't have to get all that large before it starts making sense to have in-house expertise in their operating system, language runtime, and other components that people often think of as being fairly specialized. Looking back at Twitter's history, Yao Yue has noted that when she was working on cache in Twitter's early days (when we had ~100 engineers), she would regularly go to the kernel team for help debugging production incidents and that, in some cases, debugging could've easily taken 10x longer without help from the kernel team. Social media companies tend to have relatively high scale on a per-user and per-dollar basis, so not every company is going to need the same kind of expertise when they have 100 engineers, but there are going to be other areas that aren't obviously core business needs where expertise will pay off even for a startup that has 100 engineers.

Thanks to Ben Kuhn, Yao Yue, Pam Wolf, John Hergenroeder, Julien Kirch, Tom Brearley, and Kevin Burke for comments/corrections/discussion.


  1. Some other large examples of this are Korean chaebols, like Hyundai. Looking at how Hyundai Group's companies are connected to Hyundai Motor Company isn't really the right lens with which to examine Hyundai, but I'm going to use that lens anyway since most readers of this blog are probably already familiar with Hyundai Motor and will not be familiar with how Korean chaebols operate.

    Speaking very roughly, with many exceptions, American companies have tended to take the advice to specialize and concentrate on their competencies, at least since the 80s. This is the opposite of the direction that Korean chaebols have gone. Hyundai not only makes cars, they make the steel their cars use, the robots they use to automate production, the cement used for their factories, the construction equipment used to build their factories, the containers and ships used to ship cars (which they also operate), the transmissions for their cars, etc.

    If we look at a particular component, say, their 8-speed transmission vs. the widely used and lauded ZF 8HP transmission, reviewers typically slightly prefer the ZF transmission. But even so, having good-enough in-house transmissions, as well as many other in-house components that companies would typically buy, doesn't exactly seem to be a disadvantage for Hyundai.

    [return]

Measurement, benchmarking, and data analysis are underrated

2021-08-27 08:00:00

A question I get asked with some frequency is: why bother measuring X, why not build something instead? More bluntly, in a recent conversation with a newsletter author, his comment on some future measurement projects I wanted to do (in the same vein as other projects like keyboard vs. mouse, keyboard, terminal and end-to-end latency measurements) was, "so you just want to get to the top of Hacker News?" The implication for the former is that measuring is less valuable than building and for the latter that measuring isn't valuable at all (perhaps other than for fame), but I don't see measuring as lesser let alone worthless. If anything, because measurement is, like writing, not generally valued, it's much easier to find high ROI measurement projects than high ROI building projects.

Let's start by looking at a few examples of high impact measurement projects. My go-to example for this is Kyle Kingsbury's work with Jepsen. Before Jepsen, a handful of huge companies (the now $1T+ companies that people are calling "hyperscalers") had decently tested distributed systems. They mostly didn't talk about testing methods in a way that really caused the knowledge to spread to the broader industry. Outside of those companies, most distributed systems were, by my standards, not particularly well tested.

At the time, a common pattern in online discussions of distributed correctness was:

Person A: Database X corrupted my data.
Person B: It works for me. It's never corrupted my data.
A: How do you know? Do you ever check for data corruption?
B: What do you mean? I'd know if we had data corruption (alternate answer: sure, we sometimes have data corruption, but it's probably a hardware problem and therefore not our fault)

Kyle's early work found critical flaws in nearly everything he tested, despite Jepsen being much less sophisticated then than it is now:

Many of these problems had existed for quite a while

What’s really surprising about this problem is that it’s gone unaddressed for so long. The original issue was reported in July 2012; almost two full years ago. There’s no discussion on the website, nothing in the documentation, and users going through Elasticsearch training have told me these problems weren’t mentioned in their classes.

Kyle then quotes a number of users who ran into issues into production and then dryly notes

Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present

Although we don't have an A/B test of universes where Kyle exists vs. not and can't say how long it would've taken for distributed systems to get serious about correctness in a universe where Kyle didn't exist, from having spent many years looking at how developers treat correctness bugs, I would bet on distributed systems having rampant correctness problems until someone like Kyle came along. The typical response that I've seen when a catastrophic bug is reported is that the project maintainers will assume that the bug report is incorrect (and you can see many examples of this if you look at responses from the first few years of Kyle's work). When the reporter doesn't have a repro for the bug, which is quite common when it comes to distributed systems, the bug will be written off as non-existent.

When the reporter does have a repro, the next line of defense is to argue that the behavior is fine (you can also see many examples of these from looking at responses to Kyle's work). Once the bug is acknowledged as real, the next defense is to argue that the bug doesn't need to be fixed because it's so uncommon (e.g., "It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?"). And then, after it's acknowledged that the bug should be fixed, the final line of defense is to argue that the project takes correctness very seriously and there's really nothing more that could have been done; development and test methodology doesn't need to change because it was just a fluke that the bug occurred, and analogous bugs won't occur in the future without changes in methodology.

Kyle's work blew through these defenses and, without something like it, my opinion is that we'd still see these as the main defense used against distributed systems bugs (as opposed to test methodologies that can actually produce pretty reliable systems).

That's one particular example, but I find that it's generally true that, in areas where no one is publishing measurements/benchmarks of products, the products are generally sub-optimal, often in ways that are relatively straightforward to fix once measured. Here are a few examples:

This post has made some justifications for why measuring things is valuable but, to be honest, the impetus for my measurements is curiosity. I just want to know the answer to a question; most of the time, I don't write up my results. But even if you have no curiosity about what's actually happening when you interact with the world and you're "just" looking for something useful to do, the lack of measurements of almost everything means that it's easy to find high ROI measurement projects (at least in terms of impact on the world; if you want to make money, building something is probably easier to monetize).

Appendix: the motivation for my measurement posts

There's a sense in which it doesn't really matter why I decided to write these posts, but if I were reading someone else's post on this topic, I'd still be curious what got them writing, so here's what prompted me to write my measurement posts (which, for the purposes of this list, include posts where I collate data and don't do any direct measurement).

BTW, writing up this list made me realize that a narrative I had in my head about how and when I started really looking at data seriously must be wrong. I thought that this was something that came out of my current job, but that clearly cannot be the case since a decent fraction of my posts from before my current job are about looking at data and/or measuring things (and I didn't even list some of the data-driven posts where I just read some papers and look at what data they present). Blogger, measure thyself.

Appendix: why you can't trust some reviews

One thing that both increases and decreases the impact of doing good measurements is that most measurements that are published aren't very good. This increases the personal value of understanding how to do good measurements and of doing good measurements, but it blunts the impact on other people, since people generally don't understand what makes measurements invalid and don't have a good algorithm for deciding which measurements to trust.

There are a variety of reasons that published measurements/reviews are often problematic. A major issue with reviews is that, in some industries, reviewers are highly dependent on manufacturers for review copies.

Car reviews are one of the most extreme examples of this. Consumer Reports is the only major reviewer that independently sources their cars, which often causes them to disagree with other reviewers since they'll try to buy the trim level of the car that most people buy, which is often quite different from the trim level reviewers are given by manufacturers and Consumer Reports generally manages to avoid reviewing cars that are unrepresentatively picked or tuned. There have been a couple where Consumer Reports reviewers (who also buy the cars) have said that they thought someone realized they worked for Consumer Reports and then said that they needed to keep the car overnight before giving them the car they'd just bought; when that's happened, the reviewer has walked away from the purchase.

There's pretty significant copy-to-copy variation between cars and the cars reviewers get tend to be ones that were picked to avoid cosmetic issues (paint problems, panel gaps, etc.) as well as checked for more serious issues. Additionally, cars can have their software and firmware tweaked (e.g., it's common knowledge that review copies of BMWs have an engine "tune" that would void your warranty if you modified your car similarly).

Also, because Consumer Reports isn't getting review copies from manufacturers, they don't have to pull their punches and can write reviews that are highly negative, something you rarely see from car magazines and don't often see from car youtubers, where you generally have to read between the lines to get an honest review since a review that explicitly mentions negative things about a car can mean losing access (the youtuber who goes by "savagegeese" has mentioned having trouble getting access to cars from some companies after giving honest reviews).

Camera lenses are another area where it's been documented that reviewers get unusually good copies of the item. There's tremendous copy-to-copy variation between lenses so vendors pick out good copies and let reviewers borrow those. In many cases (e.g., any of the FE mount ZA Zeiss lenses or the Zeiss lens on the RX-1), based on how many copies of a lens people need to try and return to get a good copy, it appears that the median copy of the lens has noticeable manufacturing defects and that, in expectation, perhaps one in ten lenses has no obvious defect (this could also occur if only a few copies were bad and those were serially returned, but very few photographers really check to see if their lens has issues due to manufacturing variation). Because it's so expensive to obtain a large number of lenses, the amount of copy-to-copy variation was unquantified until lensrentals started measuring it; they've found that different manufacturers can have very different levels of copy-to-copy variation, which I hope will apply pressure to lens makers that are currently selling a lot of bad lenses while selecting good ones to hand to reviewers.

Hard drives are yet another area where it's been documented that reviewers get copies of the item that aren't represnetative. Extreme Tech has reported, multiple times, that Adata, Crucial, and Western Digital have handed out review copies of SSDs that are not what you get as a consumer. One thing I find interesting about that case is that Extreme Tech says

Agreeing to review a manufacturer’s product is an extension of trust on all sides. The manufacturer providing the sample is trusting that the review will be of good quality, thorough, and objective. The reviewer is trusting the manufacturer to provide a sample that accurately reflects the performance, power consumption, and overall design of the final product. When readers arrive to read a review, they are trusting that the reviewer in question has actually tested the hardware and that any benchmarks published were fairly run.

This makes it sound like the reviewer's job is to take a trusted handed to them by the vendor and then run good benchmarks, absolving the reviewer of the responsibility of obtaining representative devices and ensuring that they're representative. I'm reminded of the SRE motto, "hope is not a strategy". Trusting vendors is not a strategy. We know that vendors will lie and cheat to look better at benchmarks. Saying that it's a vendor's fault for lying or cheating can shift the blame, but it won't result in reviews being accurate or useful to consumers.

While we've only discussed a few specific areas where there's published evidence that reviews cannot be trusted because they're compromised by companies, but this isn't anything specific to those industries. As consumers, we should expect that any review that isn't performed by a trusted, independent, agency, that purchases its own review copies has been compromised and is not representative of the median consumer experience.

Another issue with reviews is that most online reviews that are highly ranked in search are really just SEO affiliate farms.

A more general issue is that reviews are also affected by the exact same problem as items that are not reviewed: people generally can't tell which reviews are actually good and which are not, so review sites are selected on things other than the quality of the review. A prime example of this is Wirecutter, which is so popular among tech folks that noting that so many tech apartments in SF have identical Wirecutter recommended items is a tired joke. For people who haven't lived in SF, you can get a peek into the mindset by reading the comments on this post about how it's "impossible" to not buy the wirecutter recommendation for anything which is full of comments from people who re-assure that poster that, due to the high value of the poster's time, it would be irresponsible to do anything else.

The thing I find funny about this is that if you take benchmarking seriously (in any field) and just read the methodology for the median Wirecutter review, without even trying out the items reviewed you can see that the methodology is poor and that they'll generally select items that are mediocre and sometimes even worst in class. A thorough exploration of this really deserves its own post, but I'll cite one example of poorly reviewed items here: in https://benkuhn.net/vc, Ben Kuhn looked into how to create a nice video call experience, which included trying out a variety of microphones and webcams. Naturally, Ben tried Wirecutter's recommended microphone and webcam. The webcam was quite poor, no better than using the camera from an ancient 2014 iMac or his 2020 Macbook (and, to my eye, actually much worse; more on this later). And the microphone was roughly comparable to using the built-in microphone on his laptop.

I have a lot of experience with Wirecutter's recommended webcam because so many people have it and it is shockingly bad in a distinctive way. Ben noted that, if you look at a still image, the white balance is terrible when used in the house he was in, and if you talk to other people who've used the camera, that is a common problem. But the issue I find to be worse is that, if you look at the video, under many conditions (and I think most, given how often I see this), the webcam will refocus regularly, making the entire video flash out of and then back into focus (another issue is that it often focuses on the wrong thing, but that's less common and I don't see that one with everybody who I talk to who uses Wirecutter's recommended webcam). I actually just had a call yesterday with a friend of mine who was using a different setup than I'd normally seen him with, the mediocre but perfectly acceptable macbook webcam. His video was going in and out of focus every 10-30 seconds, so I asked him if he was using Wirecutter's recommended webcam and of course he was, because what other webcam would someone in tech buy that has the same problem?

This level of review quality is pretty typical for Wirecutter reviews and they appear to generally be the most respected and widely used review site among people in tech.

Appendix: capitalism

When I was in high school, there was a clique of proto-edgelords who did things like read The Bell Curve and argue its talking points to anyone who would listen.

One of their favorite topics was how the free market would naturally cause companies that make good products rise to the top and companies that make poor products to disappear, resulting in things generally being safe, a good value, and so on and so forth. I still commonly see this opinion espoused by people working in tech, including people who fill their condos with Wirecutter recommended items. I find the juxtaposition of people arguing that the market will generally result in products being good while they themselves buy overpriced garbage to be deliciously ironic. To be fair, it's not all overpriced garbage. Some of it is overpriced mediocrity and some of it is actually good; it's just that it's not too different from what you'd get if you just naively bought random stuff off of Amazon without reading third-party reviews.

For a related discussion, see this post on people who argue that markets eliminate discrimination even as they discriminate.

Appendix: other examples of the impact of measurement (or lack thereof)

Thanks to Fabian Giesen, Ben Kuhn, Yuri Vishnevsky, @chordowl, Seth Newman, Justin Blank, Per Vognsen, John Hergenroeder, Pam Wolf, Ivan Echevarria, and Jamie Brandon for comments/corrections/discussion.

Against essential and accidental complexity

2020-12-29 08:00:00

In the classic 1986 essay, No Silver Bullet, Fred Brooks argued that there is, in some sense, not that much that can be done to improve programmer productivity. His line of reasoning is that programming tasks contain a core of essential/conceptual1 complexity that's fundamentally not amenable to attack by any potential advances in technology (such as languages or tooling). He then uses an Ahmdahl's law argument, saying that because 1/X of complexity is essential, it's impossible to ever get more than a factor of X improvement via technological improvements.

Towards the end of the essay, Brooks claims that at least 1/2 (most) of complexity in programming is essential, bounding the potential improvement remaining for all technological programming innovations combined to, at most, a factor of 22:

All of the technological attacks on the accidents of the software process are fundamentally limited by the productivity equation:

Time of task = Sum over i { Frequency_i Time_i }

If, as I believe, the conceptual components of the task are now taking most of the time, then no amount of activity on the task components that are merely the expression of the concepts can give large productivity gains.

Brooks states a bound on how much programmer productivity can improve. But, in practice, to state this bound correctly, one would have to be able to conceive of problems that no one would reasonably attempt to solve due to the amount of friction involved in solving the problem with current technologies.

Without being able to predict the future, this is impossible to estimate. If we knew the future, it might turn out that there's some practical limit on how much computational power or storage programmers can productively use, bounding the resources available to a programmer, but getting a bound on the amount of accidental complexity would still require one to correctly reason about how programmers are going to be able to use zillions times more resources than are available today, which is so difficult we might as well call it impossible.

Moreover, for each class of tool that could exist, one would have to effectively anticipate all possible innovations. Brooks' strategy for this was to look at existing categories of tools and state, for each, that they would be ineffective or that they were effective but played out. This was wrong not only because it underestimated gains from classes of tools that didn't exist yet, weren't yet effective, or he wasn't familiar with (e.g., he writes off formal methods, but it doesn't even occur to him to mention fuzzers, static analysis tools that don't fully formally verify code, tools like valgrind, etc.) but also because Brooks thought that every class of tool where there was major improvement was played out and it turns out that none of them were. For example, Brooks wrote off programming languages as basically done, just before the rise of "scripting languages" as well as just before GC languages took over the vast majority of programming3. Although you will occasionally hear statements like this, not many people will volunteer to write a webapp in C because gains from modern languages can't be more than 2x over using a modern language.

Another one Brooks writes off is AI, saying "The techniques used for speech recognition seem to have little in common with those used for image recognition, and both are different from those used in expert systems". But, of course this is no longer true now — neural nets are highly effective for both image recognition and speech recognition. Whether or not they'll be highly effective as a programming tool is to be determined, but a lynchpin of Brooks's argument against AI has been invalidated and it's not a stretch to think that a greatly improved GPT-2 could give significant productivity gains to programmers. Of course, it's not reasonable to expect that Brooks could've foreseen neural nets becoming effective for both speech and image recognition, but that's exactly what makes it unreasonable for Brooks to write off all future advance in AI as well as every other field of computer science.

Brooks also underestimates gains from practices and tooling that enables practices. Just for example, looking at what old school programming gurus advocated, we have Ken Thompson arguing that language safety is useless and that bugs happen because people write fragile code, which they should not do if they don't want to have bugs and Jamie Zawinski arguing that, when on a tight deadline, automated testing is a waste of time and "there’s a lot to be said for just getting it right the first time" without testing. Brooks acknowledges the importance of testing, but the only possible improvement to testing that he mentions are expert systems that could make testing easier for beginners. If you look at the complexity of moderately large scale modern software projects, they're well beyond any software project that had been seen in the 80s. If you really think about what it would mean to approach these projects using old school correctness practices, I think the speedup from those sorts of practices to modern practices is infinite for a typical team since most teams using those practices would fail to produce a working product at all if presented with a problem that many big companies have independently solved, e.g., produce a distributed database with some stated SLO. Someone could dispute the infinite speedup claim, but anyone who's worked on a complex project that's serious about correctness will have used tools and techniques that result in massive development speedups, easily more than 2x compared to 80s practices, a possibility that didn't seem to occur to Brooks as it appears that Brooks thought that serious testing improvements were not possible due to the essential complexity involved in testing.

Another basic tooling/practice example would be version control. A version control system that multi-file commits, branches, automatic merging that generally works as long as devs don't touch the same lines, etc., is a fairly modern invention. During the 90s, Microsoft was at the cutting edge of software development and they didn't manage to get a version control system that supported the repo size they needed (30M LOC for Win2k development) and supported branches until after Win2k. Branches were simulated by simply copying the entire source tree and then manually attempting to merge copies of the source tree. Special approval was required to change the source tree and, due to the pain of manual merging, the entire Win2k team (5000 people, including 1400 devs and 1700 testers) could only merge 100 changes per day on a good day (0 on a bad day when the build team got stalled due to time spent fixing build breaks). This was a decade after Brooks was writing and there was still easily an order of magnitude speedup available from better version control tooling, test tooling and practices, machine speedups allowing faster testing, etc. Note that, in addition to not realizing that version control and test tooling would later result in massive productivity gains, Brooks claimed that hardware speedups wouldn't make developers significantly more productive even though hardware speed was noted to be a major limiting factor in Win2k development velocity. Brooks couldn't conceive of anyone building a project as complex as Win2k, which could really utilize faster hardware. Of course, using the tools and practices of Brooks's time, it was practically impossible to build as project as complex as Win2k, but tools and practices advanced so quickly that it was possible only a decade later even if development velocity moved in slow motion compared to what we're used to today due to "stone age" tools and practices.

To pick another sub-part of the above, Brooks didn't list CI/CD as a potential productivity improvement because Brooks couldn't even imagine ever having tools that could possibly enable modern build practices. Writing in 1995, Brooks mentions that someone from Microsoft told him that they build nightly. To that, Brooks says that it may be too much work to enable building (at least) once a day, noting that Bell Northern Research, quite reasonably, builds weekly. Shortly after Brooks wrote that, Google was founded and engineers at Google couldn't even imagine settling for a setup like Microsoft had, let alone building once a week. They had to build a lot of custom software to get a monorepo of Google's scale on to what would be considered modern practices today, but they were able to do it. A startup that I worked for that was founded in 1995 also built out its own CI infra that allowed for constant merging and building from HEAD because that's what anyone who was looking at what could be done instead of thinking that everything that could be done has been done would do. For large projects, just having CI/CD alone and maintaining a clean build over building weekly should easily be a 2x productivity improvement, large than would be possible if Brooks's claim that half of complexity was essential would allow for. It's good that engineers at Google, the startup I worked for, as well as many other places didn't believe that it wasn't possible to get a 2x improvement and actually built tools that enabled massive productivity improvements.

In some sense, looking at No Silver Bullet is quite similar to when we looked at Unix and found the Unix mavens saying that we should write software like they did in the 70s and that the languages they invented are as safe as any language can be. Long before computers were invented, elders have been telling the next generation that they've done everything that there is to be done and that the next generation won't be able to achieve more. In the computer age, we've seen countless similar predictions outside of programming as well, such as Cliff Stoll's now-infamous prediction that the internet wouldn't chagne anything:

Visionaries see a future of telecommuting workers, interactive libraries and multimedia classrooms. They speak of electronic town meetings and virtual communities. Commerce and business will shift from offices and malls to networks and modems. And the freedom of digital networks will make government more democratic.

Baloney. Do our computer pundits lack all common sense? The truth is no online database will replace your daily newspaper ... How about electronic publishing? Try reading a book on disc. At best, it's an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can't tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we'll soon buy books and newspapers straight over the Intenet. Uh, sure. ... Then there's cyberbusiness. We're promised instant catalog shopping—just point and click for great deals. We'll order airline tickets over the network, make restaurant reservations and negotiate sales contracts. Stores will become obselete. So how come my local mall does more business in an afternoon than the entire Internet handles in a month?

If you do a little search and replace, Stoll is saying the same thing Brooks did. Sure, technologies changed things in the past, but I can't imagine how new technologies would change things, so they simply won't.

Even without knowing any specifics about programming, we would be able to see that these kinds of arguments have not historically help up and have decent confidence that the elders are not, in fact, correct this time.

Brooks kept writing about software for quite a while after he was a practitioner, but didn't bother to keep up with what was happening in industry after moving into Academia in 1964, which is already obvious from the 1986 essay we looked at, but even more obvious if you look at his 2010 book, Design of Design, where he relies on the same examples he relied on in earlier essays and books, where the bulk of his new material comes from a house that he built. We've seen that programmers who try to generalize their knowledge to civil engineering generally make silly statements that any 2nd year civil engineering student can observe are false, and it turns out that trying to glean deep insights about software engineering design techniques from house building techniques doesn't work any better, but since Brooks didn't keep up with the industry, that's what he had to offer. While there are timeless insights that transcend era and industry, Brooks has very specific suggestions, e.g., running software teams like cocktail party surgical teams, which come from thinking about how one could improve on the development practices Brooks saw at IBM in the 50s. But it turns out the industry has moved well beyond IBM's 1950s software practices and ideas that are improvements over what IBM did in the 1950s aren't particularly useful 70 years later.

Going back to the main topic of this post and looking at the specifics of what he talks about with respect to accidental complexity with the benefit of hindsight, we can see that Brooks' 1986 claim that we've basically captured all the productivity gains high-level languages can provide isn't too different from an assembly language programmer saying the same thing in 1955, thinking that assembly is as good as any language can be4 and that his claims about other categories are similar. The main thing these claims demonstrate are a lack of imagination. When Brooks referred to conceptual complexity, he was referring to complexity of using the conceptual building blocks that Brooks was familiar with in 1986 (on problems that Brooks would've thought of as programming problems). There's no reason anyone should think that Brooks' 1986 conception of programming is fundamental any more than they should think that how an assembly programmer from 1955 thought was fundamental. People often make fun of the apocryphal "640k should be enough for anybody" quote, but Brooks saying that, across all categories of potential productivity improvement, we've done most of what's possible to do, is analogous and not apocryphal!

If we look at the future, the fraction of complexity that might be accidental is effectively unbounded. One might argue that, if we look at the present, these terms wouldn't be meaningless. But, while this will vary by domain, I've personally never worked on a non-trivial problem that isn't completely dominated by accidental complexity, making the concept of essential complexity meaningless on any problem I've worked on that's worth discussing.

Appendix: concrete problems

Let's see how this essential complexity claim holds for a couple of things I did recently at work:

Logs

If we break this task down, we have

In 1986, perhaps I would have used telnet or ftp instead of scp. Modern scripting languages didn't exist yet (perl was created in 1987 and perl5, the first version that some argue is modern, was released in 1994), so writing code that would do this with parallelism and "good enough" error handling would have taken more than an order of magnitude more time than it takes today. In fact, I think just getting semi-decent error handling while managing a connection pool could have easily taken an order of magnitude longer than this entire task took me (not including time spent downloading logs in the background).

Next up would be parsing the logs. It's not fair to compare an absolute number like "1 TB", so let's just call this "enough that we care about performance" (we'll talk about scale in more detail in the metrics example). Today, we have our choice of high-performance languages where it's easy to write, fast, safe code and harness the power of libraries (e.g., a regexp library5) that make it easy to write a quick and dirty script to parse and classify logs, farming out the work to all of the cores on my computer (I think Zig would've also made this easy, but I used Rust because my team has a critical mass of Rust programmers).

In 1986, there would have been no comparable language, but more importantly, I wouldn't have been able to trivially find, download, and compile the appropriate libraries and would've had to write all of the parsing code by hand, turning a task that took a few minutes into a task that I'd be lucky to get done in an hour. Also, if I didn't know how to use the library or that I could use a library, I could easily find out how I should solve the problem on StackOverflow, which would massively reduce accidental complexity. Needless to say, there was no real equivalent to Googling for StackOverflow solutions in 1986.

Moreover, even today, this task, a pretty standard programmer devops/SRE task, after at least an order of magnitude speedup over the analogous task in 1986, is still nearly entirely accidental complexity.

If the data were exported into our metrics stack or if our centralized logging worked a bit differently, the entire task would be trivial. And if neither of those were true, but the log format were more uniform, I wouldn't have had to write any code after getting the logs; rg or ag would have been sufficient. If I look for how much time I spent on the essential conceptual core of the task, it's so small that it's hard to estimate.

Query metrics

We really only need one counter-example, but I think it's illustrative to look at a more complex task to see how Brooks' argument scales for a more involved task. If you'd like to skip this lengthy example, click here to skip to the next section.

We can view my metrics querying task as being made up of the following sub-tasks:

The first of these tasks is so many orders of magnitude quicker to accomplish today that I'm not even able to hazard a guess to as to how much quicker it is today within one or two orders of magnitude, but let's break down the first task into component parts to get some idea about the ways in which the task has gotten easier.

It's not fair to port absolute numbers like 100 PB into 1986, but just the idea of having a pipeline that collects and persists comprehensive data analogous to the data I was looking at for a consumer software company (various data on the resource usage and efficiency of our software) would have been considered absurd in 1986. Here we see one fatal flaw in the concept of accidental essential complexity providing an upper bound on productivity improvements: tasks with too much accidental complexity wouldn't have even been considered possible. The limit on how much accidental complexity Brooks sees is really a limit of his imagination, not something fundamental.

Brooks explicitly dismisses increased computational power as something that will not improve productivity ("Well, how many MIPS can one use fruitfully?", more on this later), but both storage and CPU power (not to mention network speed and RAM) were sources of accidental complexity so large that they bounded the space of problems Brooks was able to conceive of.

In this example, let's say that we somehow had enough storage to keep the data we want to query in 1986. The next part would be to marshall on the order of 1 CPU-year worth of resources and have the query complete in minutes. As with the storage problem, this would have also been absurd in 19866, so we've run into a second piece of non-essential complexity so large that it would stop a person from 1986 from thinking of this problem at all.

Next up would be writing the query. If I were writing for the Cray-2 and wanted to be productive, I probably would have written the queries in Cray's dialect of Fortran 77. Could I do that in less than 300 seconds per query? Not a chance; I couldn't even come close with Scala/Scalding and I think it would be a near thing even with Python/PySpark. This is the aspect where I think we see the smallest gain and we're still well above one order of magnitude here.

After we have the data processed, we have to generate the plots. Even with today's technology, I think not using ggplot would cost me at least 2x in terms of productivity. I've tried every major plotting library that's supposedly equivalent (in any language) and every library I've tried either has multiple show-stopping bugs rendering plots that I consider to be basic in ggplot or is so low-level that I lose more than 2x productivity by being forced to do stuff manually that would be trivial in ggplot. In 2020, the existence of a single library already saves me 2x on this one step. If we go back to 1986, before the concept of the grammar of graphics and any reasonable implementation, there's no way that I wouldn't lose at least two orders of magnitude of time on plotting even assuming some magical workstation hardware that was capable of doing the plotting operations I do in a reasonable amount of time (my machine is painfully slow at rendering the plots; a Cray-2 would not be able to do the rendering in anything resembling a reasonable timeframe).

The number of orders of magnitude of accidental complexity reduction for this problem from 1986 to today is so large I can't even estimate it and yet this problem still contains such a large fraction of accidental complexity that it's once again difficult to even guess at what fraction of complexity is essential. To write it all down all of the accidental complexity I can think of would require at least 20k words, but just to provide a bit of the flavor of the complexity, let me write down a few things.

For each of Presto and ggplot I implicitly hold over a hundred things in my head to be able to get my queries and plots to work and I choose to use these because these are the lowest overhead tools that I know of that are available to me. If someone asked me to name the percentage of complexity I had to deal with that was essential, I'd say that it was so low that there's no way to even estimate it. For some queries, it's arguably zero — my work was necessary only because of some arbitrary quirk and there would be no work to do without the quirk. But even in cases where some kind of query seems necessary, I think it's unbelievable that essential complexity could have been more than 1% of the complexity I had to deal with.

Revisiting Brooks on computer performance, even though I deal with complexity due to the limitations of hardware performance in 2020 and would love to have faster computers today, Brooks wrote off faster hardware as pretty much not improving developer productivity in 1986:

What gains are to be expected for the software art from the certain and rapid increase in the power and memory capacity of the individual workstation? Well, how many MIPS can one use fruitfully? The composition and editing of programs and documents is fully supported by today’s speeds. Compiling could stand a boost, but a factor of 10 in machine speed would surely . . .

But this is wrong on at least two levels. First, if I had access to faster computers, a huge amount of my accidental complexity would go away (if computers were powerful enough, I wouldn't need complex tools like Presto; I could just run a query on my local computer). We have much faster computers now, but it's still true that having faster computers would make many involved engineering tasks trivial. As James Hague notes, in the mid-80s, writing a spellchecker was a serious engineering problem due to performance constraints.

Second, (just for example) ggplot only exists because computers are so fast. A common complaint from people who work on performance is that tool X has somewhere between two and ten orders of magnitude of inefficiency when you look at the fundamental operations it does vs. the speed of hardware today7. But what fraction of programmers can realize even one half of the potential performance of a modern multi-socket machine? I would guess fewer than one in a thousand and I would say certainly fewer than one in a hundred. And performance knowledge isn't independent of other knowledge — controlling for age and experience, it's negatively correlated with knowledge of non-"systems" domains since time spent learning about the esoteric accidental complexity necessary to realize half of the potential of a computer is time spent not learning about "directly" applicable domain knowledge. When we look software that requires a significant amount of domain knowledge (e.g., ggplot) or that's large enough that it requires a large team to implement (e.g., IntelliJ8), the vast majority of it wouldn't exist if machines were orders of magnitude slower and writing usable software required wringing most of the performance out of the machine. Luckily for us, hardware has gotten much faster, allowing the vast majority of developers to ignore performance-related accidental complexity and instead focus on all of the other accidental complexity necessary to be productive today.

Faster computers both reduce the amount of accidental complexity tool users run into as well as the amount of accidental complexity that tool creators need to deal with, allowing more productive tools to come into existence.

2022 Update

A lot of people have said that this post is wrong because Brooks was obviously saying X and Brooks did not mean the things I quoted in this post. But people state all sorts of different Xs for what Brooks really meant so, in aggregate, these counterarguments are self-refuting because they think that Brooks "obviously" meant one specific thing but, if it were so obvious, people wouldn't have so many different ideas of what Brooks meant.

This is, of course, inevitable when it comes to a Rorschach test essay like Brooks's essay, which states a wide variety of different and contradictory things.

Thanks to Peter Bhat Harkins, Ben Kuhn, Yuri Vishnevsky, Chris Granger, Wesley Aptekar-Cassels, Sophia Wisdom, Lifan Zeng, Scott Wolchok, Martin Horenovsky, @realcmb, Kevin Burke, Aaron Brown, @up_lurk, and Saul Pwanson for comments/corrections/discussion.


  1. The accidents I discuss in the next section. First let us consider the essence

    The essence of a software entity is a construct of interlocking concepts: data sets, relationships among data items, algorithms, and invocations of functions. This essence is abstract, in that the conceptual construct is the same under many different representations. It is nonetheless highly precise and richly detailed.

    I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation. We still make syntax errors, to be sure; but they are fuzz compared to the conceptual errors in most systems.

    [return]
  2. Curiously, he also claims, in the same essay, that no individual improvement can yield a 10x improvement within one decade. While this technically doesn't contradict his Ahmdal's law argument plus the claim that "most" (i.e., at least half) of complexity is essential/conceptual, it's unclear why he would include this claim as well.

    When Brooks revisited his essay in 1995 in No Silver Bullet Refired, he claimed that he was correct by using the weakest form of the three claims he made in 1986, that within one decade, no single improvement would result in an order of magnitude improvement. However, he did then re-state the strongest form of the claim he made in 1986 and made it again in 1995, saying that this time, no set of technological improvements could improve productivity more than 2x, for real:

    It is my opinion, and that is all, that the accidental or representational part of the work is now down to about half or less of the total. Since this fraction is a question of fact, its value could in principle be settled by measurement. Failing that, my estimate of it can be corrected by better informed and more current estimates. Significantly, no one who has written publicly or privately has asserted that the accidental part is as large as 9/10.

    By the way, I find it interesting that he says that no one disputed this 9/10ths figure. Per the body of this post, I would put it at far above 9/10th for my day-to-day work and, if I were to try to solve the same problems in 1986, the fraction would have been so high that people wouldn't have even conceived of the problem. As a side effect of having worked in hardware for a decade, I've also done work that's not too different from what some people faced in 1986 (microcode, assembly & C written for DOS) and I would put that work as easily above 9/10th as well.

    Another part of his follow-up that I find interesting is that he quotes Harel's "Biting the Silver Bullet" from 1992, which, among other things, argues that that decade deadline for an order of magnitude improvement is arbitrary. Brooks' response to this is

    There are other reasons for the decade limit: the claims made for candidate bullets all have had a certain immediacy about them . . . We will surely make substantial progress over the next 40 years; an order of magnitude over 40 years is hardly magical.

    But by Brooks' own words when he revisits the argument in 1995, if 9/10th of complexity is essential, it would be impossible to get more than an order of magnitude improvement from reducing it, with no caveat on the timespan:

    "NSB" argues, indisputably, that if the accidental part of the work is less than 9/10 of the total, shrinking it to zero (which would take magic) will not give an order of magnitude productivity improvement.

    Both his original essay and the 1995 follow-up are charismatically written and contain a sort of local logic, where each piece of the essay sounds somewhat reasonable if you don't think about it too hard and you forget everything else the essay says. As with the original, a pedant could argue that this is technically not incoherent — after all, Brooks could be saying:

    • at most 9/10th of complexity is accidental (if we ignore the later 1/2 claim, which is the kind of suspension of memory/disbelief one must do to read the essay)
    • it would not be surprising for us to eliminate 100% of accidental complexity after 40 years

    While this is technically consistent (again, if we ignore the part that's inconsistent) and is a set of claims one could make, this would imply that 40 years from 1986, i.e., in 2026, it wouldn't be implausible for there to be literally zero room for any sort of productivity improvement from tooling, languages, or any other potential source of improvement. But this is absurd. If we look at other sections of Brooks' essay and combine their reasoning, we see other inconsistencies and absurdities.

    [return]
  3. Another issue that we see here is Brooks' insistence on bright-line distinctions between categories. Essential vs. accidental complexity. "Types" of solutions, such as languages vs. "build vs. buy", etc.

    Brooks admits that "build vs. buy" is one avenue of attack on essential complexity. Perhaps he would agree that buying a regexp package would reduce the essential complexity since that would allow me to avoid keeping all of the concepts associated with writing a parser in my head for simple tasks. But what if, instead of buying regexes, I used a language where they're bundled into the standard library or is otherwise distributed with the language? Or what if, instead of having to write my own concurrency primitives, those are bundled into the language? Or for that matter, what about an entire HTTP server? There is no bright-line distinction between what's in a library one can "buy" (for free in many cases nowadays) and one that's bundled into the language, so there cannot be a bright-line distinction between what gains a language provides and what gains can be "bought". But if there's no bright-line distinction here, then it's not possible to say that one of these can reduce essential complexity and the other can't and maintain a bright-line distinction between essential and accidental complexity (in a response to Brooks, Harel argued against there being a clear distinction in a response, and Brooks' response was to say that there there is, in fact, a bright-line distinction, although he provided no new argument).

    Brooks' repeated insistence on these false distinctions means that the reasoning in the essay isn't composable. As we've already seen in another footnote, if you take reasoning from one part of the essay and apply it alongside reasoning from another part of the essay, it's easy to create absurd outcomes and sometimes outright contradictions.

    I suspect this is one reason discussions about essential vs. accidental complexity are so muddled. It's not just that Brooks is being vague and handwave-y, he's actually not self-consistent, so there isn't and cannot be a coherent takeaway. Michael Feathers has noted that people are generally not able to correct identify essential complexity; as he says, One person’s essential complexity is another person’s accidental complexity.. This is exactly what we should expect from the essay, since people who have different parts of it in mind will end up with incompatible views.

    This is also a problem when criticizing Brooks. Inevitably, someone will say that what Brooks really meant was something completely different. And that will be true. But Brooks will have meant something completely different while also having meant the things he said that I mention. In defense of the view I'm presenting in the body of the text here, it's a coherent view that one could have had in 1986. Many of Brooks' statements don't make sense even when considered as standalone statements, let alone when cross-referenced with the rest of his essay. For example, the statement that no single development will result in an order of magnitude improvement in the next decade. This statement is meaningless as Brooks does not define and no one can definitively say what a "single improvement" is. And, as mentioned above, Brooks' essay reads quite oddly and basically does not make sense if that's what he's trying to claim. Another issue with most other readings of Brooks is that those are positions that are also meaningless even if Brooks had done the work to make them well defined. Why does it matter if one single improvement or two result in an order of magnitude improvement. If it's two improvements, we'll use them both.

    [return]
  4. And by the way, this didn't only happen in 1955. I've worked with people who, this century, told me that assembly is basically as productive as any high level language. This probably sounds ridiculous to almost every reader of this blog, but if you talk to people who spend all day writing microcode or assembly, you'll occasionally meet somebody who believes this.

    Thinking that the tools you personally use are as good as it gets is an easy trap to fall into.

    [return]
  5. Another quirk is that, while Brooks acknowledges that code re-use and libraries can increase productivity, he claims languages and tools are pretty much over, but both of these claims can't hold because there isn't a bright-line distinction between libraries and languages/tools. [return]
  6. Let's arbitrarily use a Motorola 68k processor with an FP co-processor that could do 200 kFLOPS as a reference for how much power we might have in a consumer CPU (FLOPS is a bad metric for multiple reasons, but this is just to get an idea of what it would take to get 1 CPU-year of computational resources, and Brooks himself uses MIPS as a term as if it's meaningful). By comparison, the Cray-2 could achieve 1.9 GFLOPS, or roughly 10000x the performance (I think actually less if we were to do a comparable comparison instead of using non-comparable GFLOPS numbers, but let's be generous here). There are 525600 / 5 = 105120 five minute periods in a year, so to get 1 CPU year's worth of computation in five minutes we'd need 105120 / 10000 = 10 Cray-2s per query, not including the overhead of aggregating results across Cray-2s.

    It's unreasonable to think that a consumer software company in 1986 would have enough Cray-2s lying around to allow for any random programmer to quickly run CPU years worth of queries whenever they wanted to do some data analysis. One sources claims that 27 Cray-2s were ever made over the production lifetime of the machine (1985 to 1990). Even if my employer owned all of them and they were all created by 1986, that still wouldn't be sufficient to allow the kind of ad hoc querying capacity that I have access to in 2020.

    Today, someone at a startup can even make an analogous argument when comparing to a decade ago. You used to have to operate a cluster that would be prohibitively annoying for a startup to operate unless the startup is very specialized, but you can now just use Snowflake and basically get Presto but only pay for the computational power you use (plus a healthy markup) instead of paying to own a cluster and for all of the employees necessary to make sure the cluster is operable.

    [return]
  7. I actually run into one of these every time I publish a new post. I write my posts in Google docs and then copy them into emacs running inside tmux running inside Alacritty. My posts are small enough to fit inside L2 cache, so I could have 64B/3.5 cycle write bandwidth. And yet, the copy+paste operation can take ~1 minute and is so slow I can watch the text get pasted in. Since my chip is working super hard to make sure the copy+paste happens, it's running at its full non-turbo frequency of 4.2Ghz, giving it 76.8GB/s of write bandwidth. For a 40kB post, 1 minute = 666B/s. 76.8G/666 =~ 8 orders of magnitude left on the table. [return]
  8. In this specific case, I'm sure somebody will argue that Visual Studio was quite nice in 2000 and ran on much slower computers (and the debugger was arguably better than it is in the current version). But there was no comparable tool on Linux, nor was there anything comparable to today's options in the VSCode-like space of easy-to-learn programming editor that provides programming-specific facilities (as opposed to being a souped up version of notepad) without being a full-fledged IDE. [return]

How do cars fare in crash tests they're not specifically optimized for?

2020-06-30 15:06:34

Any time you have a benchmark that gets taken seriously, some people will start gaming the benchmark. Some famous examples in computing are the CPU benchmark specfp and video game benchmarks. With specfp, Sun managed to increase its score on 179.art (a sub-benchmark of specfp) by 12x with a compiler tweak that essentially re-wrote the benchmark kernel, which increased the Sun UltraSPARC’s overall specfp score by 20%. At times, GPU vendors have added specialized benchmark-detecting code to their drivers that lowers image quality during benchmarking to produce higher benchmark scores. Of course, gaming the benchmark isn't unique to computing and we see people do this in other fields. It’s not surprising that we see this kind of behavior since improving benchmark scores by cheating on benchmarks is much cheaper (and therefore higher ROI) than improving benchmark scores by actually improving the product.

As a result, I'm generally suspicious when people take highly specific and well-known benchmarks too seriously. Without other data, you don't know what happens when conditions aren't identical to the conditions in the benchmark. With GPU and CPU benchmarks, it’s possible for most people to run the standard benchmarks with slightly tweaked conditions. If the results change dramatically for small changes to the conditions, that’s evidence that the vendor is, if not cheating, at least shading the truth.

Benchmarks of physical devices can be more difficult to reproduce. Vehicle crash tests are a prime example of this -- they're highly specific and well-known benchmarks that use up a car for some test runs.

While there are multiple organizations that do crash tests, they each have particular protocols that they follow. Car manufacturers, if so inclined, could optimize their cars for crash test scores instead of actual safety. Checking to see if crash tests are being gamed with hyper-specific optimizations isn't really feasible for someone who isn't a billionaire. The easiest way we can check is by looking at what happens when new tests are added since that lets us see a crash test result that manufacturers weren't optimizing for just to get a good score.

While having car crash test results is obviously better than not having them, the results themselves don't tell us what happens when we get into an accident that doesn't exactly match a benchmark. Unfortunately, if we get into a car accident, we don't get to ask the driver of the vehicle we're colliding with to change their location, angle of impact, and speed, in order for the collision to comply with an IIHS, NHTSA, or *NCAP, test protocol.

For this post, we're going to look at IIHS test scores when they added the (driver side) small overlap and passenger side small overlap tests, which were added in 2012, and 2018, respectively. We'll start with a summary of the results and then discuss what those results mean and other factors to consider when evaluating car safety, followed by details of the methodology.

Results

The ranking below is mainly based on how well vehicles scored when the driver-side small overlap test was added in 2012 and how well models scored when they were modified to improve test results.

These descriptions are approximations. Honda, Ford, and Tesla are the poorest fits for these descriptions, with Ford arguably being halfway in between Tier 4 and Tier 5 but also arguably being better than Tier 4 and not fitting into the classification and Honda and Tesla not really properly fitting into any category (with their category being the closest fit), but some others are also imperfect. Details below.

General commentary

If we look at overall mortality in the U.S., there's a pretty large age range for which car accidents are the leading cause of death. Although the numbers will vary depending on what data set we look at, when the driver-side small overlap test was added, the IIHS estimated that 25% of vehicle fatalities came from small overlap crashes. It's also worth noting that small overlap crashes were thought to be implicated in a significant fraction of vehicle fatalities at least since the 90s; this was not a novel concept in 2012.

Despite the importance of small overlap crashes, from looking at the results when the IIHS added the driver-side and passenger-side small overlap tests in 2012 and 2018, it looks like almost all car manufacturers were optimizing for benchmark and not overall safety. Except for Volvo, all carmakers examined produced cars that fared poorly on driver-side small overlap crashes until the driver-side small overlap test was added.

When the driver-side small overlap test was added in 2012, most manufacturers modified their vehicles to improve driver-side small overlap test scores. However, until the IIHS added a passenger-side small overlap test in 2018, most manufacturers skimped on the passenger side. When the new test was added, they beefed up passenger safety as well. To be fair to car manufacturers, some of them got the hint about small overlap crashes when the driver-side test was added in 2012 and did not need to make further modifications to score well on the passenger-side test, including Mercedes, BMW, and Tesla (and arguably a couple of others, but the data is thinner in the other cases; Volvo didn't need a hint).

Other benchmark limitations

There are a number of other areas where we can observe that most car makers are optimizing for benchmarks at the expensive of safety.

Gender, weight, and height

Another issue is crash test dummy overfitting. For a long time, adult NHSTA and IIHS tests used a 1970s 50%-ile male dummy, which is 5'9" and 171lbs. Regulators called for a female dummy in 1980 but due to budget cutbacks during the Reagan era, initial plans were shelved and the NHSTA didn't put one in a car until 2003. The female dummy is a scaled down version of the male dummy, scaled down to 5%-ile 1970s height and weight (4'11", 108lbs; another model is 4'11", 97lbs). In frontal crash tests, when a female dummy is used, it's always a passenger (a 5%-ile woman is in the driver's seat in one NHSTA side crash test and the IIHS side crash test). For reference, in 2019, the average weight of a U.S. adult male was 198 lbs and the average weight of a U.S. adult female was 171 lbs.

Using a 1970s U.S. adult male crash test dummy causes a degree of overfitting for 1970s 50%-ile men. For example, starting in the 90s, manufacturers started adding systems to protect against whiplash. Volvo and Toyota use a kind of system that reduces whiplash in men and women and appears to have slightly more benefit for women. Most car makers use a kind of system that reduces whiplash in men but, on average, has little impact on whiplash injuries in women.

It appears that we also see a similar kind of optimization for crashes in general and not just whiplash. We don't have crash test data on this, and looking at real-world safety data is beyond the scope of this post, but I'll note that, until around the time the NHSTA put the 5%-ile female dummy into some crash tests, most car manufacturers not named Volvo had a significant fatality rate differential in side crashes based on gender (with men dying at a lower rate and women dying at a higher rate).

Volvo claims to have been using computer models to simulate what would happen if women (including pregnant women) are involved in a car accident for decades.

Other crashes

Volvo is said to have a crash test facility where they do a number of other crash tests that aren't done by testing agencies. A reason that they scored well on the small overlap tests when they were added is that they were already doing small overlap crash tests before the IIHS started doing small overlap crash tests.

Volvo also says that they test rollovers (the IIHS tests roof strength and the NHSTA computes how difficult a car is to roll based on properties of the car, but neither tests what happens in a real rollover accident), rear collisions (Volvo claims these are especially important to test if there are children in the 3rd row of a 3-row SUV), and driving off the road (Volvo has a "standard" ditch they use; they claim this test is important because running off the road is implicated in a large fraction of vehicle fatalities).

If other car makers do similar tests, I couldn't find much out about the details. Based on crash test scores, it seems like they weren't doing or even considering small overlap crash tests before 2012. Based on how many car makers had poor scores when the passenger side small overlap test was added in 2018, I think it would be surprising if other car makers had a large suite of crash tests they ran that aren't being run by testing agencies, but it's theoretically possible that they do and just didn't include a passenger side small overlap test.

Caveats

We shouldn't overgeneralize from these test results. As we noted above, crash test results test very specific conditions. As a result, what we can conclude when a couple new crash tests are added is also very specific. Additionally, there are a number of other things we should keep in mind when interpreting these results.

Limited sample size

One limitation of this data is that we don't have results for a large number of copies of the same model, so we're unable to observe intra-model variation, which could occur due to minor, effectively random, differences in test conditions as well as manufacturing variations between different copies of same model. We can observe that these do matter since some cars will see different results when two copies of the same model are tested. For example, here's a quote from the IIHS report on the Dodge Dart:

The Dodge Dart was introduced in the 2013 model year. Two tests of the Dart were conducted because electrical power to the onboard (car interior) cameras was interrupted during the first test. In the second Dart test, the driver door opened when the hinges tore away from the door frame. In the first test, the hinges were severely damaged and the lower one tore away, but the door stayed shut. In each test, the Dart’s safety belt and front and side curtain airbags appeared to adequately protect the dummy’s head and upper body, and measures from the dummy showed little risk of head and chest injuries.

It looks like, had electrical power to the interior car cameras not been disconnected, there would have been only one test and it wouldn't have become known that there's a risk of the door coming off due to the hinges tearing away. In general, we have no direct information on what would happen if another copy of the same model were tested.

Using IIHS data alone, one thing we might do here is to also consider results from different models made by the same manufacturer (or built on the same platform). Although this isn't as good as having multiple tests for the same model, test results between different models from the same manufacturer are correlated and knowing that, for example, a 2nd test of a model that happened by chance showed significantly worse results should probably reduce our confidence in other test scores from the same manufacturer. There are some things that complicate this, e.g., if looking at Toyota, the Yaris is actually a re-branded Mazda2, so perhaps that shouldn't be considered as part of a pooled test result, and doing this kind of statistical analysis is beyond the scope of this post.

Actual vehicle tested may be different

Although I don't think this should impact the results in this post, another issue to consider when looking at crash test results is how results are shared between models. As we just saw, different copies of the same model can have different results. Vehicles that are somewhat similar are often considered the same for crash test purposes and will share the same score (only one of the models will be tested).

For example, this is true of the Kia Stinger and the Genesis G70. The Kia Stinger is 6" longer than the G70 and a fully loaded AWD Stinger is about 500 lbs heavier than a base-model G70. The G70 is the model that IIHS tested -- if you look up a Kia Stinger, you'll get scores for a Stinger with a note that a base model G70 was tested. That's a pretty big difference considering that cars that are nominally identical (such as the Dodge Darts mentioned above) can get different scores.

Quality may change over time

We should also be careful not to overgeneralize temporally. If we look at crash test scores of recent Volvos (vehicles on the Volvo P3 and Volvo SPA platforms), crash test scores are outstanding. However, if we look at Volvo models based on the older Ford C1 platform1, crash test scores for some of these aren't as good (in particular, while the S40 doesn't score poorly, it scores Acceptable in some categories instead of Good across the board). Although Volvo has had stellar crash test scores recently, this doesn't mean that they have always had or will always have stellar crash test scores.

Models may vary across markets

We also can't generalize across cars sold in different markets, even for vehicles that sound like they might be identical. For example, see this crash test of a Nissan NP300 manufactured for sale in Europe vs. a Nissan NP300 manufactured for sale in Africa. Since European cars undergo EuroNCAP testing (similar to how U.S. cars undergo NHSTA and IIHS testing), vehicles sold in Europe are optimized to score well on EuroNCAP tests. Crash testing cars sold in Africa has only been done relatively recently, so car manufacturers haven't had PR pressure to optimize their cars for benchmarks and they'll produce cheaper models or cheaper variants of what superficially appear to be the same model. This appears to be no different from what most car manufacturers do in the U.S. or Europe -- they're optimizing for cost as long as they can do that without scoring poorly on benchmarks. It's just that, since there wasn't an African crash test benchmark, that meant they could go all-in on the cost side of the cost-safety tradeoff2.

This report compared U.S. and European car models and found differences in safety due to differences in regulations. They found that European models had lower injury risk in frontal/side crashes and that driver-side mirrors were designed in a way that reduced the risk of lane-change crashes relative to U.S. designs and that U.S. vehicles were safer in rollovers and had headlamps that made pedestrians more visible.

Non-crash tests

Over time, more and more of the "low hanging fruit" from crash safety has been picked, making crash avoidance relatively more important. Tests of crash mitigation are relatively primitive compared to crash tests and we've seen that crash tests had and have major holes. One might expect, based on what we've seen with crash tests, that Volvo has a particularly good set of tests they use for their crash avoidance technology (traction control, stability control, automatic braking, etc.), but "bar room" discussion with folks who are familiar with what vehicle safety tests are being done on automated systems seems to indicate that's not the case. There was a relatively recent recall of quite a few Volvo vehicles due to the safety systems incorrectly not triggering. I'm not going to tell the story about that one here, but I'll say that it's fairly horrifying and indicative of serious systemic issues. From other backchannel discussions, it sounds like BMW is relatively serious about the software side of safety, for a car company, but the lack of rigor in this kind of testing would be horrifying to someone who's seen a release process for something like a mainstream CPU.

Crash avoidance becoming more important might also favor companies that have more user-friendly driver assistance systems, e.g., in multiple generations of tests, Consumer Reports has given GM's Super Cruise system the highest rating while they've repeatedly noted that Tesla's Autopilot system facilitates unsafe behavior.

Scores of vehicles of different weights aren't comparable

A 2700lb subcompact vehicle that scores Good may fare worse than a 5000lb SUV that scores Acceptable. This is because the small overlap tests involve driving the vehicle into a fixed obstacle, as opposed to a reference vehicle or vehicle-like obstacle of a specific weight. This is, in some sense, equivalent to crashing the vehicle into a vehicle of the same weight, so it's as if the 2700lb subcompact was tested by running it into a 2700lb subcompact and the 5000lb SUV was tested by running it into another 5000 lb SUV.

How to increase confidence

We've discussed some reasons we should reduce our confidence in crash test scores. If we wanted to increase our confidence in results, we could look at test results from other test agencies and aggregate them and also look at public crash fatality data (more on this later). I haven't looked at the terms and conditions of scores from other agencies, but one complication is that the IIHS does not allow you to display the result of any kind of aggregation if you use their API or data dumps (I, time consumingly, did not use their API for this post because of that).

Using real life crash data

Public crash fatality data is complex and deserves its own post. In this post, I'll note that, if you look at the easiest relevant data for people in the U.S., this data does not show that Volvos are particularly safe (or unsafe). For example, if we look at this report from 2017, which covers models from 2014, two Volvo models made it into the report and both score roughly middle of the pack for their class. In the previous report, one Volvo model is included and it's among the best in its class, in the next, one Volvo model is included and it's among the worst in its class. We can observe this kind of variance for other models, as well. For example, among 2014 models, the Volkswagen Golf had one of the highest fatality rates for all vehicles (not just in its class). But among 2017 vehicles, it had among the lowest fatality rates for all vehicles. It's unclear how much of that change is from random variation and how much is because of differences between a 2014 and 2017 Volkswagen Golf.

Overall, it seems like noise is a pretty important factor in results. And if we look at the information that's provided, we can see a few things that are odd. First, there are a number of vehicles where the 95% confidence interval for the fatality rate runs from 0 to N. We should have pretty strong priors that there was no 2014 model vehicle that was so safe that the probability of being killed in a car accident was zero. If we were taking a Bayesian approach (though I believe the authors of the report are not), and someone told us that the uncertainty interval for the true fatality rate of a vehicle had a >= 5% of including zero, we would say that either we should use a more informative prior or we should use a model that can incorporate more data (in this case, perhaps we could try to understand the variance between fatality rates of different models in the same class and then use the base rate of fatalities for the class as a prior, or we could incorporate information from other models under the same make if those are believed to be correlated).

Some people object to using informative priors as a form of bias laundering, but we should note that the prior that's used for the IIHS analysis is not completely uninformative. All of the intervals reported stop at zero because they're using the fact that a vehicle cannot create life to bound the interval at zero. But we have information that's nearly as strong that no 2014 vehicle is so safe that the expected fatality rate is zero, using that information is not fundamentally different from capping the interval at zero and not reporting negative numbers for the uncertainty interval of the fatality rate.

Also, the IIHS data only includes driver fatalities. This is understandable since that's the easiest way to normalize for the number of passengers in the car, but it means that we can't possibly see the impact of car makers not improving passenger small-overlap safety until the passenger-side small overlap test was added in 2018, the result of lack of rear crash testing for the case Volvo considers important (kids in the back row of a 3rd row SUV). This also means that we cannot observe the impact of a number of things Volvo has done, e.g., being very early on pedestrian and then cyclist detection in their automatic braking system, adding a crumple zone to reduce back injuries in run-off-road accidients, which they observed often cause life-changing spinal injuries due to the impact from vehicles drop, etc.

We can also observe that, in the IIHS analysis, many factors that one might want to control for aren't (e.g., miles driven isn't controlled for, which will make trucks look relatively worse and luxury vehicles look relatively better, rural vs. urban miles driven also isn't controlled for, which will also have the same directional impact). One way to see that the numbers are heavily influenced by confounding factors is by looking at AWD or 4WD vs. 2WD versions of cars. They often have wildly different fatalty rates even though the safety differences are not very large (and the difference is often in favor of the 2WD vehicle). Some plausible causes of that are random noise, differences in who buys different versions of the same vehicle, and differences in how the vehicle are used.

If we'd like to answer the question "which car makes or models are more or less safe", I don't find any of the aggregations that are publicly available to be satisfying and I think we need to look at the source data and do our own analysis to see if the data are consistent with what we see in crash test results.

Conclusion

We looked at 12 different car makes and how they fared when the IIHS added small overlap tests. We saw that only Volvo was taking this kind of accident seriously before companies were publicly shamed for having poor small overlap safety by the IIHS even though small overlap crashes were known to be a significant source of fatalities at least since the 90s.

Although I don't have the budget to do other tests, such as a rear crash test in a fully occupied vehicle, it appears plausible and perhaps even likely that most car makers that aren't Volvo would have mediocre or poor test scores if a testing agency decided to add another kind of crash test.

Bonus: "real engineering" vs. programming

As Hillel Wayne has noted, although programmers often have an idealized view of what "real engineers" do, when you compare what "real engineers" do with what programmers do, it's frequently not all that different. In particular, a common lament of programmers is that we're not held liable for our mistakes or poor designs, even in cases where that costs lives.

Although automotive companies can, in some cases, be held liable for unsafe designs, just optimizing for a small set of benchmarks, which must've resulted in extra deaths over optimizing for safety instead of benchmark scores, isn't something that engineers or corporations were, in general, held liable for.

Bonus: reputation

If I look at what people in my extended social circles think about vehicle safety, Tesla has the best reputation by far. If you look at broad-based consumer polls, that's a different story, and Volvo usually wins there, with other manufacturers fighting for a distant second.

I find the Tesla thing interesting since their responses are basically the opposite of what you'd expect from a company that was serious about safety. When serious problems have occurred (with respect to safety or otherwise), they often have a very quick response that's basically "everything is fine". I would expect an organization that's serious about safety or improvement to respond with "we're investigating", followed by a detailed postmortem explaining what went wrong, but that doesn't appear to be Tesla's style.

For example, on the driver-side small overlap test, Tesla had one model with a relevant score and it scored Acceptable (below Good, but above Poor and Marginal) even after modifications were made to improve the score. Tesla disputed the results, saying they make "the safest cars in history" and implying that IIHS should be ignored because they have ulterior motives, in favor of crash test scores from an agency that is objective and doesn't have ulterior motives, i.e., the agency that gave Tesla a good score:

While IIHS and dozens of other private industry groups around the world have methods and motivations that suit their own subjective purposes, the most objective and accurate independent testing of vehicle safety is currently done by the U.S. Government which found Model S and Model X to be the two cars with the lowest probability of injury of any cars that it has ever tested, making them the safest cars in history.

As we've seen, Tesla isn't unusual for optimizing for a specific set of crash tests and achieving a mediocre score when an unexpected type of crash occurs, but their response is unusual. However, it makes sense from a cynical PR perspective. As we've seen over the past few years, loudly proclaiming something, regardless of whether or not it's true, even when there's incontrovertible evidence that it's untrue, seems to not only work, that kind of bombastic rhetoric appears to attract superfans who will aggressively defend the brand. If you watch car reviewers on youtube, they'll sometimes mention that they get hate mail for reviewing Teslas just like they review any other car and that they don't see anything like it for any other make.

Apple also used this playbook to good effect in the 90s and early '00s, when they were rapidly falling behind in performance and responded not by improving performance, but by running a series of ad campaigns saying that had the best performance in the world and that they were shipping "supercomputers" on the desktop.

Another reputational quirk is that I know a decent number of people who believe that the safest cars they can buy are "American Cars from the 60's and 70's that aren't made of plastic". We don't have directly relevant small overlap crash test scores for old cars, but the test data we do have on old cars indicates that they fare extremely poorly in overall safety compared to modern cars. For a visually dramatic example, see this crash test of a 1959 Chevrolet Bel Air vs. a 2009 Chevrolet Malibu.

Appendix: methodology summary

The top-line results section uses scores for the small overlap test both because it's the one where I think it's the most difficult to justify skimping on safety as measured by the test and it's also been around for long enough that we can see the impact of modifications to existing models and changes to subsequent models, which isn't true of the passenger side small overlap test (where many models are still untested).

For the passenger side small overlap test, someone might argue that the driver side is more important because you virtually always have a driver in a car accident and may or may not have a front passenger. Also, for small overlap collisions (which simulates a head-to-head collision where the vehicles only overlap by 25%), driver's side collisions are more likely than passenger side collisions.

Except to check Volvo's scores, I didn't look at roof crash test scores (which were added in 2009). I'm not going to describe the roof test in detail, but for the roof test, someone might argue that the roof test score should be used in conjunction with scoring the car for rollover probability since the roof test just tests roof strength, which is only relevant when a car has rolled over. I think, given what the data show, this objection doesn't hold in many cases (the vehicles with the worst roof test scores are often vehicles that have relatively high rollover rates), but it does in some cases, which would complicate the analysis.

In most cases, we only get one reported test result for a model. However, there can be multiple versions of a model -- including before and after making safety changes intended to improve the test score. If changes were made to the model to improve safety, the test score is usually from after the changes were made and we usually don't get to see the score from before the model was changed. However, there are many exceptions to this, which are noted in the detailed results section.

For this post, scores only count if the model was introduced before or near when the new test was introduced, since models introduced later could have design changes that optimize for the test.

Appendix: detailed results

On each test, IIHS gives an overall rating (from worst to best) of Poor, Marginal, Acceptable, or Good. The tests have sub-scores, but we're not going to use those for this analysis. In each sub-section, we'll look at how many models got each score when the small overlap tests were added.

Volvo

All Volvo models examined scored Good (the highest possible score) on the new tests when they were added (roof, driver-side small overlap, and passenger-side small overlap). One model, the 2008-2017 XC60, had a change made to trigger its side curtain airbag during a small overlap collision in 2013. Other models were tested without modifications.

Mercedes

Of three pre-existing models with test results for driver-side small overlap, one scored Marginal without modifications and two scored Good after structural modifications. The model where we only have unmodified test scores (Mercedes C-Class) was fully re-designed after 2014, shortly after the driver-side small overlap test was introduced.

As mentioned above, we often only get to see public results for models without modifications to improve results xor with modifications to improve results, so, for the models that scored Good, we don't actually know how they would've scored if you bought a vehicle before Mercedes updated the design, but the Marginal score from the one unmodified model we have is a negative signal.

Also, when the passenger side small overlap test was added, the Mercedes vehicles also generally scored Good. This is, indicating that Mercedes didn't only increase protection on the driver's side in order to improve test scores.

BMW

Of the two models where we have relevant test scores, both scored Marginal before modifications. In one of the cases, there's also a score after structural changes were made in the 2017 model (recall that the driver-side small overlap test was introduced in 2012) and the model scored Good afterwards. The other model was fully-redesigned after 2016.

For the five models where we have relevant passenger-side small overlap scores, all scored Good, indicating that the changes made to improve driver-side small overlap test scores weren't only made on the driver's side.

Honda

Of the five Honda models where we have relevant driver-side small overlap test scores, two scored Good, one scored Marginal, and two scored Poor. The model that scored Marginal had structural changes plus a seatbelt change in 2015 that changed its score to Good, other models weren't updated or don't have updated IIHS scores.

Of the six Honda models where we have passenger driver-side small overlap test scores, two scored Good without modifications, two scored Acceptable without modifications, and one scored Good with modifications to the bumper.

All of those models scored Good on the driver side small overlap test, indicating that when Honda increased the safety on the driver's side to score Good on the driver's side test, they didn't apply the same changes to the passenger side.

Toyota

Of the six Toyota models where we have relevant driver-side small overlap test scores for unmodified models, one score Acceptable, four scored Marginal, and one scored Poor.

The model that scored Acceptable had structural changes made to improve its score to Good, but on the driver's side only. The model was later tested in the passenger-side small overlap test and scored Acceptable. Of the four models that scored Marginal, one had structural modifications made in 2017 that improved its score to Good and another had airbag and seatbelt changes that improved its score to to Acceptable. The vehicle that scored Poor had structural changes made that improved its score to acceptable in 2014, followed by later changes that improved its score to Good.

There are four additional models where we only have scores from after modifications were made. Of those, one scored Good, one score Acceptable, one scored Marginal, and one scored Poor.

In general, changes appear to have been made to the driver's side only and, on introduction of the passenger side small overlap test, vehicles had passenger side small overlap scores that were the same as the driver's side score before modifications.

Ford

Of the two models with relevant driver-side small overlap test scores for unmodified models, one scored Marginal and one scored Poor. Both of those models were produced into 2019 and neither has an updated test result. Of the three models where we have relevant results for modified vehicles, two scored Acceptable and one score Marginal. Also, one model was released the year the small overlap test was introduced and one the year after; both of those scored Acceptable. It's unclear if those should be considered modified or not since the design may have had last-minute changes before release.

We only have three relevant passenger-side small overlap tests. One is Good (for a model released in 2015) and the other two are Poor; these are the two models mentioned above as having scored Marginal and Poor, respectively, on the driver-side small overlap test. It appears that the models continued to be produced into 2019 without safety changes. Both of these unmodified models were trucks and this isn't very unusual for a truck and is one of a number of reasons that fatality rates are generally higher in trucks -- until recently, many of them are based on old platforms that hadn't been updated for a long time.

Chevrolet

Of the three Chevrolet models where we have relevant driver-side small overlap test scores before modifications, one scored Acceptable and two scored Marginal. One of the Marginal models had structural changes plus a change that caused side curtain airbags to deploy sooner in 2015, which improved its score to Good.

Of the four Chevrolet models where we only have relevant driver-side small overlap test scores after the model was modified (all had structural modifications), two scored Good and two scored Acceptable.

We only have one relevant score for the passenger-side small overlap test, that score is Marginal. That's on the model that was modified to improve its driver-side small overlap test score from Marginal to Good, indicating that the changes were made to improve the driver-side test score and not to improve passenger safety.

Subaru

We don't have any models where we have relevant passenger-side small overlap test scores for models before they were modified.

One model had a change to cause its airbag to deploy during small overlap tests; it scored Acceptable. Two models had some kind of structural changes, one of which scored Good and one of which score Acceptable.

The model that had airbag changes had structural changes made in 2015 that improved its score from Acceptable to Good.

For the one model where we have relevant passenger-side small overlap test scores, the score was Marginal. Also, for one of the models with structural changes, it was indicated that, among the changes, were changes to the left part of the firewall, indicating that changes were made to improve the driver's side test score without improving safety for a passenger on a passenger-side small overlap crash.

Tesla

There's only one model with relevant results for the driver-side small overlap test. That model scored Acceptable before and after modifications were made to improve test scores.

Hyundai

Of the five vehicles where we have relevant driver-side small overlap test scores, one scored Acceptable, three scored Marginal, and one scored Poor. We don't have any indication that models were modified to improve their test scores.

Of the two vehicles where we have relevant passenger-side small overlap test scores for unmodified models, one scored Good and one scored Acceptable.

We also have one score for a model that had structural modifications to score Acceptable, which later had further modifications that allowed it to score Good. That model was introduced in 2017 and had a Good score on the driver-side small overlap test without modifications, indicating that it was designed to achieve a good test score on the driver's side test without similar consideration for a passenger-side impact.

Dodge

Of the five models where we have relevant driver-side small overlap test scores for unmodified models, two scored Acceptable, one scored Marginal, and two scored Poor. There are also two models where we have test scores after structural changes were made for safety in 2015; both of those models scored Marginal.

We don't have relevant passenger-side small overlap test scores for any model, but even if we did, the dismal scores on the modified models means that we might not be able to tell if similar changes were made to the passenger side.

Nissan

Of the seven models where we have relevant driver-side small overlap test scores for unmodified models, two scored Acceptable and five scored Poor.

We have one model that only has test scores for a modified model; the frontal airbags and seatbelts were modified in 2013 and the side curtain airbags were modified in 2017. The score afterward modifications was Marginal.

One of the models that scored Poor had structural changes made in 2015 that improved its score to Good.

Of the four models where we have relevant passenger-side small overlap test scores, two scored Good, one scored Acceptable (that model scored good on the driver-side test), and one score Marginal (that model also scored Marginal on the driver-side test).

Jeep

Of the two models where we have relevant driver-side small overlap test scores for unmodified models, one scored Marginal and one scored Poor.

There's one model where we only have test score after modifications; that model has changes to its airbags and seatbelts and it scored Marginal after the changes. This model was also later tested on the passenger-side small overlap test and scored Poor.

One other model has a relevant passenger-side small overlap test score; it scored Good.

Volkswagen

The two models where we have relevant driver-side small overlap test scores for unmodified models both scored Marginal.

Of the two models where we only have scores after modifications, one was modified 2013 and scored Marginal after modifications. It was then modified again in 2015 and scored Good after modifications. That model was later tested on the passenger side small-overlap test, where it scored Acceptable, indicating that the modifications differentially favored the driver's side. The other scored Acceptable after changes made in 2015 and then scored Good after further changes made in 2016. The 2016 model was later tested on the passenger-side small overlap test and scored Marginal, once again indicating that changes differentially favored the driver's side.

We have passenger-side small overlap test for two other models, both of which scored Acceptable. These were models introduced in 2015 (well after the introduction of the driver-side small overlap test) and scored Good on the driver-side small overlap test.

2021 update

The IIHS has released the first set of results for their new "upgraded" side-impact tests. They've been making noises about doing this for quite and have mentioned that in real-world data on (some) bad crashes, they've observed intrusion into the cabin that's significantly greater than is seen on their tests. They've mentioned that some vehicles do relatively well on on the new tests and some less well but haven't released official scores until now.

The results in the new side-impact tests are different from the results described in the posts above. So far, only small SUVs have had their results released and only the Mazda CX-5 has a result of "Good". Of the three manufacturers that did well on the tests describe in this post, only Volvo has public results and they scored "Acceptable". Some questions I have are:

Appendix: miscellania

A number of name brand car makes weren't included. Some because they have relatively low sales in the U.S. are low and/or declining rapidly (Mitsubishi, Fiat, Alfa Romeo, etc.), some because there's very high overlap in what vehicles are tested (Kia, Mazda, Audi), and some because there aren't relevant models with driver-side small overlap test scores (Lexus). When a corporation owns an umbrella of makes, like FCA with Jeep, Dodge, Chrysler, Ram, etc., these weren't pooled since most people who aren't car nerds aren't going to recognize FCA, but may recognize Jeep, Dodge, and Chrysler.

If the terms of service of the API allowed you to use IIHS data however you wanted, I would've included smaller makes, but since the API comes with very restrictive terms on how you can display or discuss the data which aren't compatible with exploratory data analysis and I couldn't know how I would want to display or discuss the data before looking at the data, I pulled all of these results by hand (and didn't click through any EULAs, etc.), which was fairly time consuming, so there was a trade-off between more comprehensive coverage and the rest of my life.

Appendix: what car should I buy?

That depends on what you're looking for, there's no way to make a blanket recommendation. For practical information about particular vehicles, Alex on Autos is the best source that I know of. I don't generally like videos as a source of practical information, but car magazines tend to be much less informative than youtube car reviewers. There are car reviewers that are much more popular, but their popularity appears to come from having witty banter between charismatic co-hosts or other things that not only aren't directly related to providing information, they actually detract from providing information. If you just want to know about how cars work, Engineering Explained is also quite good, but the information there is generally practical.

For reliability information, Consumer Reports is probably your best bet (you can also look at J.D. Power, but the way they aggregate information makes it much less useful to consumers).

Thanks to Leah Hanson, Travis Downs, Prabin Paudel, Jeshua Smith, and Justin Blank for comments/corrections/discussion


  1. this includes the 2004-2012 Volvo S40/V50, 2006-2013 Volvo C70, and 2007-2013 Volvo C30, which were designed during the period when Ford owned Volvo. Although the C1 platform was a joint venture between Ford, Volvo, and Mazda engineers, the work was done under a Ford VP at a Ford facility. [return]
  2. to be fair, as we saw with the IIHS small overlap tests, not every manufacturer did terribly. In 2017 and 2018, 8 vehicles sold in Africa were crash tested. One got what we would consider a mediocre to bad score in the U.S. or Europe, five got what we would consider to be a bad score, and "only" three got what we would consider to be an atrocious score. The Nissan NP300, Datsun Go, and Cherry QQ3 were the three vehicles that scored the worst. Datsun is a sub-brand of Nissan and Cherry is a Chinese brand, also known as Qirui.

    We see the same thing if we look at cars sold in India. Recently, some tests have been run on cars sent to the Indian market and a number of vehicles from Datsun, Renault, Chevrolet, Tata, Honda, Hyundai, Suzuki, Mahindra, and Volkswagen came in with atrocious scores that would be considered impossibly bad in the U.S. or Europe.

    [return]

Finding the Story

2020-06-02 15:05:34

This is an archive of an old pseudonymously written post from the 90s from someone whose former pseudonym seems to have disappeared from the internet.

I see that Star Trek: Voyager has added a new character, a Borg. (From the photos, I also see that they're still breeding women for breast size in the 24th century.) What ticked me off was the producer's comment (I'm paraphrasing), "The addition of Seven of Nine will give us limitless story possibilities."

Uh-huh. Riiiiiight.

Look, they did't recognize the stories they had. I watched the first few episodes of Voyager and quit when my bullshit meter when off the scale. (Maybe that's not fair, to judge them by only a few episodes. But it's not fair to subject me to crap like the holographic lungs, either.)

For those of you who don't watch Star Trek: Voyager, the premise is that the Voyager, sort of a space corvette, gets transported umpteen zillions of light years from where it should be. It will take over seventy years at top speed for them to get home to their loved ones. For reasons we needn't go into here, the crew consists of a mix of loyal Federation members and rebels.

On paper, this looks good. There's an uneasy alliance in the crew, there's exploration as they try to get home, there's the whole "island in space" routine. And the Voyager is nowhere near as big as the Enterprise — it's not mentally healthy for people to stay aboard for that long.

But can this idea actually sustain a whole series? Would it be interesting to watch five years of "the crew bickers" or "they find a new clue to faster interstellar travel but it falls through"? I don't think so.

(And, in fact, the crew settled down awfully quickly.)

The demands of series television subvert the premise. The basic demand of series television is that our regular characters are people we come to know and to care about — we want them to come into our living rooms every week. We must care about their changes, their needs, their desires. We must worry when they're put in jeopardy. But we know it's a series, so it's hard to make us worry. We know that the characters will be back next week.

The demands of a story require someone to change of their own accord, to recognize some difference. The need to change can be imposed from without, but the actual change must be self-motivated. (This is the fundamental paradox of series television: the only character allowed to change is a guest, but the instrument of that change has to be a series regular, therefore depriving both characters of the chance to do something interesting.)

Series with strict continuity of episodes (episode 2 must follow episode 1) allow change — but they're harder to sell in syndication after the show goes off the air. Economics favour unchanging regular characters.

Some series — such as Hill Street Blues — get around the jeopardy problem by actually making characters disposable. Some characters show up for a few episodes and then die, reminding us that it could happen to the regulars, too. Sometimes it does happen to the regulars.

(When the characters change in the pilot, there may be a problem. A writer who was approached to work on Mary Tyler Moore's last series saw from the premise that it would be brilliant for six episodes and then had noplace to go. The first Fox series starring Tea Leoni, Flying Blind, had a very funny pilot and set up an untenable situation.)

I'm told the only interesting character on Voyager has been the doctor, who can change. He's the only character allowed to grow.

The first problem with Voyager, then, is that characters aren't allowed to change — or the change is imposed from outside. (By the way, an imposed change is a great way to start a story. The character then fights it, and that's interesting. It's a terrible way to end a story.)

The second problem is that they don't make use of the elements they have. Let's go back to the first season. There was an episode in which there's a traitor on board who is as smart as Janeway herself. (How psychiatric testing missed this, I don't know, but the Trek universe has never had really good luck with psychiatry.) After leading Janeway by the nose for fifty minutes, she figures out who it is, and confronts him. He says yes — and beams off the ship, having conveniently made a deal with the locals.

Perfect for series television. We've got a supposedly intelligent villain out there who could come back and Janeway's been given a run for her money — except that I felt cheated. Where's the story? Where's the resolution?

Here's what I think they should have done. It's not traditional series television, but I think it would have been better stories.

First of all, the episode ends when Janeway confronts the bad guy and arrests him. He's put in the brig — and stays there. The viewer gets some sense of victory here.

But now there's someone as smart as Janeway in the brig. Suddenly we've set up Silence of the Lambs. (I don't mind stealing if I steal from good sources.) Whenever a problem is big enough, Janeway has this option: she can go to the brig and try and make a deal with the bad guy. "The ship dies, you die." Not only that, here's someone on board ship with whom she has a unique relationship — one not formally bounded by rank. What does the bad guy really want?

And whenever Janeway's feeling low, he can taunt her. "By the way, I thought of a way to get everyone home in one-tenth the time. Have you, Captain?"

You wouldn't put him in every episode. But any time you need that extra push, he's there. Remember, we can have him escape any time we want, through the same sleight used in the original episode.

Furthermore, it's one thing to catch him; it's another thing to keep him there. You can generate another entire episode out of an escape attempt by the prisoner. But that would be an intermediate thing. Let's talk about the finish I would have liked to have seen.

Let's invent a crisis. The balonium generator explodes; we're deep in warp space; our crack engineering crew has jury-rigged a repair to the sensors and found a Class M planet that might do for the repairs. Except it's just too far away. The margin is tight — but can't be done. There are two too many people on board ship. Each requires a certain amount of food, air, water, etc. Under pressure, Neelix admits that his people can go into suspended animation, so he does. The doctor tries heroically but the engineer who was tending the balonium generator dies. (Hmmm. Power's low. The doctor can only be revived at certain critical moments.) Looks good — but they were using air until they died; one more crew member must die for the rest to live.

And somebody remembers the guy in the brig. "The question of his guilt," says Tuvok, "is resolved. The authority of the Captain is absolute. You are within your rights to hold a summary court martial and sentence him to death."

And Janeway says no. "The Federation doesn't do that."

Except that everyone will die if she doesn't. The pressure is on Janeway, now. Janeway being Janeway, she's looking for a technological fix. "Find an answer, dammit!" And the deadline is coming up. After a certain point, the prisoner has to die, along with someone else.

A crewmember volunteers to die (a regular). Before Janeway can accept, yet another (regular) crewmember volunteers, and Janeway is forced to decide. — And Tuvok points out that while morally it's defensible if that member volunteered to die, the ship cannot continue without either of those crewmembers. It can continue without the prisoner. Clearly the prisoner is not worth as much as those crewmembers, but she is the captain. She must make this decision.

Our fearless engineering crew thinks they might have a solution, but it will use nearly everything they've got, and they need another six hours to work on the feasibility. Someone in the crew tries to resolve the problem for her by offing the prisoner — the failure uses up more valuable power. Now the deadline moves up closer, past the six hours deadline. The engineering crew's idea is no longer feasible.

For his part, the prisoner is now bargaining. He says he's got ideas to help. Does he? He's tried to destroy the ship before. And he won't reveal them until he gets a full pardon.

(This is all basic plotting: keep piling on difficulties. Put a carrot in front of the characters, keep jerking it away.)

The tricky part is the ending. It's a requirement that the ending derive logically from what has gone before. If you're going to invoke a technological fix, you have to set the groundwork for it in the first half of the show. Otherwise it's technobabble. It's deus ex machina. (Any time someone says just after the last commercial break, "Of course! If we vorpalize the antibogon flow, we're okay!" I want to smack a writer in the head.)

Given the situation set up here, we have three possible endings:

My preferred ending is the third one, even though the prisoner need not die. The decision we've set up is a difficult one, and it is meaningful. It is a command decision. Whether she ends up killing the prisoner is not relevant; what is relevant is that she decides to do it.

John Gallishaw once categorized all stories as either stories of achievement or of decision. A decision story is much harder to write, because both choices have to matter.

A simple way to get more value from tracing

2020-05-31 15:06:34

A lot of people seem to think that distributed tracing isn't useful, or at least not without extreme effort that isn't worth it for companies smaller than FB. For example, here are a couple of public conversations that sound like a number of private conversations I've had. Sure, there's value somewhere, but it costs too much to unlock.

I think this overestimates how much work it is to get a lot of value from tracing. At Twitter, Rebecca Isaacs was able to lay out a vision for how to get value from tracing and executed on it (with help from a number other folks, including Jonathan Simms, Yuri Vishnevsky, Ruben Oanta, Dave Rusek, Hamdi Allam, and many others1) such that the work easily paid for itself. This post is going to describe the tracing "infrastructure" we've built and describe some use cases where we've found it to be valuable. Before we get to that, let's start with some background about the situation before Rebecca's vision came to fruition.

At a high level, we could say that we had a trace-view oriented system and ran into all of the issues that one might expect from that. Those issues are discussed in more detail in this article by Cindy Sridharan. However, I'd like to discuss the particular issues we had in more detail since I think it's useful to look at what specific things were causing problems.

Taken together, the issues were problematic enough that tracing was underowned and arguably unowned for years. Some individuals did work in their spare time to keep the lights on or improve things, but the lack of obvious value from tracing led to a vicious cycle where the high barrier to getting value out of tracing made it hard to fund organizationally, which made it hard to make tracing more usable.

Some of the issues that made tracing low ROI included:

Schema

The schema was effectively a set of traces, where each trace was a set of spans and each span was a set of annotations. Each span that wasn't a root span had a pointer to its parent, so that the graph structure of a trace could be determined.

For the purposes of this post, we can think of each trace as either an external request including all sub-RPCs or a subset of a request, rooted downstream instead of at the top of the request. We also trace some things that aren't requests, like builds and git operations, but for simplicity we're going to ignore those for this post even though the techniques we'll discuss also apply to those.

Each span corresponds to an RPC and each annotation is data that a developer chose to record on a span (e.g., the size of the RPC payload, queue depth of various queues in the system at the time of the span, or GC pause time for GC pauses that interrupted the RPC).

Some issues that came out of having a schema that was a set of sets (of bags) included:

Aggregation

Until about a year and a half ago, the only supported way to look at traces was to go to the UI, filter by a service name from a combination search box + dropdown, and then look at a list of recent traces, where you could click on any trace to get a "trace view". Each search returned the N most recent results, which wouldn't necessarily be representative of all recent results (for reasons mentioned below in the Sampling section), let alone representative of all results over any other time span.

Per the problems discussed above in the schema section, since it was too expensive to run queries across a non-trivial number of traces, it was impossible to ask questions like "are any of the traces I'm looking at representative of common traces or am I looking at weird edge cases?" or "show me traces of specific tail events, e.g., when a request from service A to service B times out or when write amplification from service A to some backing database is > 3x", or even "only show me complete traces, i.e., traces where we haven't dropped spans from the trace".

Also, if you clicked on a trace that was "too large", the query would time out and you wouldn't be able to view the trace -- this was another common side effect of the lack of any kind of rate limiting logic plus the schema.

Sampling

There were multiple places where a decision was made to sample or not. There was no document that listed all of these places, making it impossible to even guess at the sampling rate without auditing all code to figure out where sampling decisions were being made.

Moreover, there were multiple places where an unintentional sampling decision would be made due to the implementation. Spans were sent from services that had tracing enabled to a local agent, then to a "collector" service, and then from the collector service to our backing DB. Spans could be dropped at of these points: in the local agent; in the collector, which would have nodes fall over and lose all of their data regularly; and at the backing DB, which would reject writes due to hot keys or high load in general.

This design where the trace id is the database key, with no intervening logic to pace out writes, meant that a 1M span trace (which we have) would cause 1M writes to the same key over a period of a few seconds. Another problem would be requests with a fanout of thousands (which exists at every tech company I've worked for), which could cause thousands writes with the same key over a period of a few milliseconds.

Another sampling quirk was that, in order to avoid missing traces that didn't start at our internal front end, there was logic that caused an independent sampling decision in every RPC. If you do the math on this, if you have a service-oriented architecture like ours and you sample at what naively might sound like a moderately low rate, like, you'll end up with the vast majority of your spans starting at a leaf RPC, resulting in a single span trace. Of the non-leaf RPCs, the vast majority will start at the 2nd level from the leaf, and so on. The vast majority of our load and our storage costs were from these virtually useless traces that started at or near a leaf, and if you wanted to do any kind of analysis across spans to understand the behavior of the entire system, you'd have to account for this sampling bias on top of accounting for all of the other independent sampling decisions.

Time

There wasn't really any kind of adjustment for clock skew (there was something, but it attempted to do a local pairwise adjustment, which didn't really improve things and actually made it more difficult to reasonably account for clock skew).

If you just naively computed how long a span took, even using timestamps from a single host, which removes many sources of possible clock skew, you'd get a lot of negative duration spans, which is of course impossible because a result can't get returned before the request for the result is created. And if you compared times across different hosts, the results were even worse.

Solutions

The solutions to these problems fall into what I think of as two buckets. For problems like dropped spans due to collector nodes falling over or the backing DB dropping requests, there's some straightforward engineering solution using well understood and widely used techniques. For that particular pair of problems, the short term bandaid was to do some GC tuning that reduced the rate of collector nodes falling over by about a factor of 100. That took all of two minutes, and then we replaced the collector nodes with a real queue that could absorb larger bursts in traffic and pace out writes to the DB. For the issue where we oversampled leaf-level spans due to rolling the sampling dice on every RPC, that's one of these little questions that most people would get right in an interview that can sometimes get lost as part of a larger system that has a number of solutions, e.g., since each span has a parent pointer, we must be able to know if an RPC has a parent or not in a relevant place and we can make a sampling decision and create a traceid iff a span has no parent pointer, which results in a uniform probability of each span being sampled, with each sampled trace being a complete trace.

The other bucket is building up datasets and tools (and adding annotations) that allow users to answer questions they might have. This isn't a new idea, section 5 of the Dapper paper discussed this and it was published in 2010.

Of course, one major difference is that Google has probably put at least two orders of magnitude more effort into building tools on top of Dapper than we've put into building tools on top of our tracing infra, so a lot of our tooling is much rougher, e.g., figure 6 from the Dapper paper shows a trace view that displays a set of relevant histograms, which makes it easy to understand the context of a trace. We haven't done the UI work for that yet, so the analogous view requires running a simple SQL query. While that's not hard, presenting the user with the data would be a better user experience than making the user query for the data.

Of the work that's been done, the simplest obviously high ROI thing we've done is build a set of tables that contain information people might want to query, structured such that common queries that don't inherently have to do a lot of work don't have to do a lot of work.

We have, partitioned by day, the following tables:

Just having this set of tables, queryable with SQL queries (or a Scalding or Spark job in cases where Presto SQL isn't ideal, like when doing some graph queries) is enough for tracing to pay for itself, to go from being difficult to justify to being something that's obviously high value.

Some of the questions we've been to answer with this set of tables includes:

We have built and are building other tooling, but just being able to run queries and aggregations against trace data, both recent and historical, easily pays for all of the other work we'd like to do. This analogous to what we saw when we looked at metrics data, taking data we already had and exposing it in a way that lets people run arbitrary queries immediately paid dividends. Doing that for tracing is less straightforward than doing that for metrics because the data is richer, but it's a not fundamentally different idea.

I think that having something to look at other than the raw data is also more important for tracing than it is for metrics since the metrics equivalent of a raw "trace view" of traces, a "dashboard view" of metrics where you just look at graphs, is obviously and intuitively useful. If that's all you have for metrics, people aren't going to say that it's not worth funding your metrics infra because dashboards are really useful! However, it's a lot harder to see how to get value out of a raw view of traces, which is where a lot of the comments about tracing not being valuable come from. This difference between the complexity of metrics data and tracing data makes the value add for higher-level views of tracing larger than it is for metrics.

Having our data in a format that's not just blobs in a NoSQL DB has also allowed us to more easily build tooling on top of trace data that lets users who don't want to run SQL queries get value out of our trace data. An example of this is the Service Dependency Explorer (SDE), which was primarily built by Yuri Vishnevsky, Rebecca Isaacs, and Jonathan Simms, with help from Yihong Chen. If we try to look at the RPC call graph for a single request, we get something that's pretty large. In some cases, the depth of the call tree can be hundreds of levels deep and it's also not uncommon to see a fanout of 20 or more at some levels, which makes a naive visualization difficult to interpret.

In order to see how SDE works, let's look at a smaller example where it's relatively easy to understand what's going on. Imagine we have 8 services, A through H and they call each other as shown in the tree below, we we have service A called 10 times, which calls service B a total of 10 times, which calls D, D, and E 50, 20, and 10 times respectively, where the two Ds are distinguished by being different RPC endpoints (calls) even though they're the same service, and so on, shown below:

Diagram of RPC call graph; this will implicitly described in the relevant sections, although the entire SDE section in showing off a visual tool and will probably be unsatisfying if you're just reading the alt text; the tables described in the previous section are more likely to be what you want if you want a non-visual interpretation of the data, the SDE is a kind of visualization

If we look at SDE from the standpoint of node E, we'll see the following: SDE centered on service E, showing callers and callees, direct and indirect

We can see the direct callers and callees, 100% of calls of E are from C, and 100% of calls of E also call C and that we have 20x load amplification when calling C (200/10 = 20), the same as we see if we look at the RPC tree above. If we look at indirect callees, we can see that D has a 4x load amplification (40 / 10 = 4).

If we want to see what's directly called by C downstream of E, we can select it and we'll get arrows to the direct descendants of C, which in this case is every indirect callee of E.

SDE centered on service E, with callee C highlighted

For a more complicated example, we can look at service D, which shows up in orange in our original tree, above.

In this case, our summary box reads:

The fact that we see D three times in the tree is indicated in the summary box, where it says we have 3 unique call paths from our front end, TFE to D.

We can expand out the calls to D and, in this case, see both of the calls and what fraction of traffic is to each call.

SDE centered on service D, with different calls to D expanded by having clicked on D

If we click on one of the calls, we can see which nodes are upstream and downstream dependencies of a particular call, call4 is shown below and we can see that it never hits services C, H, and G downstream even though service D does for call3. Similarly, we can see that its upstream dependencies consist of being called directly by C, and indirectly by B and E but not A and C:

SDE centered on service D, with call4 of D highlighted by clicking on call 4; shows only upstream and downstream load that are relevant to call4

Some things we can easily see from SDE are:

These are all things a user could get out of queries to the data we store, but having a tool with a UI that lets you click around in real time to explore things lowers the barrier to finding these things out.

In the example shown above, there are a small number of services, so you could get similar information out of the more commonly used sea of nodes view, where each node is a service, with some annotations on the visualization, but when we've looked at real traces, showing thousands of services and a global makes it very difficult to see what's going on. Some of Rebecca's early analyses used a view like that, but we've found that you need to have a lot of implicit knowledge to make good use of a view like that, a view that discards a lot more information and highlights a few things makes it easier to users who don't happen to have the right implicit knowledge to get value out of looking at traces.

Although we've demo'd a view of RPC count / load here, we could also display other things, like latency, errors, payload sizes, etc.

Conclusion

More generally, this is just a brief description of a few of the things we've built on top of the data you get if you have basic distributed tracing set up. You probably don't want to do exactly what we've done since you probably have somewhat different problems and you're very unlikely to encounter the exact set of problems that our tracing infra had. From backchannel chatter with folks at other companies, I don't think the level of problems we had was unique; if anything, our tracing infra was in a better state than at many or most peer companies (which excludes behemoths like FB/Google/Amazon) since it basically worked and people could and did use the trace view we had to debug real production issues. But, as they say, unhappy systems are unhappy in their own way.

Like our previous look at metrics analytics, this work was done incrementally. Since trace data is much richer than metrics data, a lot more time was spent doing ad hoc analyses of the data before writing the Scalding (MapReduce) jobs that produce the tables mentioned in this post, but the individual analyses were valuable enough that there wasn't really a time when this set of projects didn't pay for itself after the first few weeks it took to clean up some of the worst data quality issues and run an (extremely painful) ad hoc analysis with the existing infra.

Looking back at discussions on whether or not it makes sense to work on tracing infra, people often point to the numerous failures at various companies to justify a buy (instead of build) decision. I don't think that's exactly unreasonable, the base rate of failure of similar projects shouldn't be ignored. But, on the other hand, most of the work described wasn't super tricky, beyond getting organizational buy-in and having a clear picture of the value that tracing can bring.

One thing that's a bit beyond the scope of this post that probably deserves its own post is that, tracing and metrics, while not fully orthogonal, are complementary and having only one or the other leaves you blind to a lot of problems. You're going to pay a high cost for that in a variety of ways: unnecessary incidents, extra time spent debugging incidents, generally higher monetary costs due to running infra inefficiently, etc. Also, while metrics and tracing individually gives you much better visibility than having either alone, some problemls require looking at both together; some of the most interesting analyses I've done involve joining (often with a literal SQL join) trace data and metrics data.

To make it concrete, an example of something that's easy to see with tracing but annoying to see with logging unless you add logging to try to find this in particular (which you can do for any individual case, but probably don't want to do for the thousands of things tracing makes visible), is something we looked at above: "show me cases where a specific call path from the load balancer to A causes high load amplification on some service B, which may be multiple hops away from A in the call graph. In some cases, this will be apparent because A generally causes high load amplificaiton on B, but if it only happens in some cases, that's still easy to handle with tracing but it's very annoying if you're just looking at metrics.

An example of something where you want to join tracing and metrics data is when looking at the performance impact of something like a bad host on latency. You will, in general, not be able to annotate the appropriate spans that pass through the host as bad because, if you knew the host was bad at the time of the span, the host wouldn't be in production. But you can sometimes find, with historical data, a set of hosts that are bad, and then look up latency critical paths that pass through the host to determine the end-to-end impact of the bad host.

Everyone has their own biases, with respect to tracing, mine come from generally working on things that try to direct improve cost, reliability, and latency, so the examples are focused on that, but there are also a lot of other uses for tracing. You can check out Distributed Tracing in Practice or Mastering Distributed Tracing for some other perspectives.

Acknowledgements

Thanks to Rebecca Isaacs, Leah Hanson, Yao Yue, and Yuri Vishnevsky for comments/corrections/discussion.


  1. this will almost certainly be an incomplete list, but some other people who've pitched in include Moses, Tiina, Rich, Rahul, Ben, Mike, Mary, Arash, Feng, Jenny, Andy, Yao, Yihong, Vinu, and myself.

    Note that this relatively long list of contributors doesn't contradict this work being high ROI. I'd estimate that there's been less than 2 person-years worth of work on everything discussed in this post. Just for example, while I spend a fair amount of time doing analyses that use the tracing infra, I think I've only spent on the order of one week on the infra itself.

    In case it's not obvious from the above, even though I'm writing this up, I was a pretty minor contributor to this. I'm just writing it up because I sat next to Rebecca as this work was being done and was super impressed by both her process and the outcome.

    [return]

A simple way to get more value from metrics

2020-05-30 15:06:34

We spent one day1 building a system that immediately found a mid 7 figure optimization (which ended up shipping). In the first year, we shipped mid 8 figures per year worth of cost savings as a result. The key feature this system introduces is the ability to query metrics data across all hosts and all services and over any period of time (since inception), so we've called it LongTermMetrics (LTM) internally since I like boring, descriptive, names.

This got started when I was looking for a starter project that would both help me understand the Twitter infra stack and also have some easily quantifiable value. Andy Wilcox suggested looking at JVM survivor space utilization for some large services. If you're not familiar with what survivor space is, you can think of it as a configurable, fixed-size buffer, in the JVM (at least if you use the GC algorithm that's default at Twitter). At the time, if you looked at a random large services, you'd usually find that either:

  1. The buffer was too small, resulting in poor performance, sometimes catastrophically poor when under high load.
  2. The buffer was too large, resulting in wasted memory, i.e., wasted money.

But instead of looking at random services, there's no fundamental reason that we shouldn't be able to query all services and get a list of which services have room for improvement in their configuration, sorted by performance degradation or cost savings. And if we write that query for JVM survivor space, this also goes for other configuration parameters (e.g., other JVM parameters, CPU quota, memory quota, etc.). Writing a query that worked for all the services turned out to be a little more difficult than I was hoping due to a combination of data consistency and performance issues. Data consistency issues included things like:

Our metrics database, MetricsDB, was specialized to handle monitoring, dashboards, alerts, etc. and didn't support general queries. That's totally reasonable, since monitoring and dashboards are lower on Maslow's hierarchy of observability needs than general metrics analytics. In backchannel discussions from folks at other companies, the entire set of systems around MetricsDB seems to have solved a lot of the problems that plauge people at other companies with similar scale, but the specialization meant that we couldn't run arbitrary SQL queries against metrics in MetricsDB.

Another way to query the data is to use the copy that gets written to HDFS in Parquet format, which allows people to run arbitrary SQL queries (as well as write Scalding (MapReduce) jobs that consume the data).

Unfortunately, due to the number of metric names, the data on HDFS can't be stored in a columnar format with one column per name -- Presto gets unhappy if you feed it too many columns and we have enough different metrics that we're well beyond that limit. If you don't use a columnar format (and don't apply any other tricks), you end up reading a lot of data for any non-trivial query. The result was that you couldn't run any non-trivial query (or even many trivial queries) across all services or all hosts without having it time out. We don't have similar timeouts for Scalding, but Scalding performance is much worse and a simple Scalding query against a day's worth of metrics will usually take between three and twenty hours, depending on cluster load, making it unreasonable to use Scalding for any kind of exploratory data analysis.

Given the data infrastructure that already existed, an easy way to solve both of these problems was to write a Scalding job to store the 0.1% to 0.01% of metrics data that we care about for performance or capacity related queries and re-write it into a columnar format. I would guess that at least 90% of metrics are things that almost no one will want to look at in almost any circumstance, and of the metrics anyone really cares about, the vast majority aren't performance related. A happy side effect of this is that since such a small fraction of the data is relevant, it's cheap to store it indefinitely. The standard metrics data dump is deleted after a few weeks because it's large enough that it would be prohibitively expensive to store it indefinitely; a longer metrics memory will be useful for capacity planning or other analyses that prefer to have historical data.

The data we're saving includes (but isn't limited to) the following things for each shard of each service:

And for each host:

For things that we know change very infrequently (like host NIC speed), we store these daily, but most of these are stored at the same frequency and granularity that our other metrics is stored for. In some cases, this is obviously wasteful (e.g., for JVM tenuring threshold, which is typically identical across every shard of a service and rarely changes), but this was the easiest way to handle this given the infra we have around metrics.

Although the impetus for this project was figuring out which services were under or over configured for JVM survivor space, it started with GC and container metrics since those were very obvious things to look at and we've been incrementally adding other metrics since then. To get an idea of the kinds of things we can query for and how simple queries are if you know a bit of SQL, here are some examples:

Very High p90 JVM Survivor Space

This is part of the original goal of finding under/over-provisioned services. Any service with a very high p90 JVM survivor space utilization is probably under-provisioned on survivor space. Similarly, anything with a very low p99 or p999 JVM survivor space utilization when under peak load is probably overprovisioned (query not displayed here, but we can scope the query to times of high load).

A Presto query for very high p90 survivor space across all services is:

with results as (
  select servicename,
    approx_distinct(source, 0.1) as approx_sources, -- number of shards for the service
    -- real query uses [coalesce and nullif](https://prestodb.io/docs/current/functions/conditional.html) to handle edge cases, omitted for brevity
    approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.90) as p90_used,
    approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.50) as p50_used,
  from ltm_service 
  where ds >= '2020-02-01' and ds <= '2020-02-28'
  group by servicename)
select * from results
where approx_sources > 100
order by p90_used desc

Rather than having to look through a bunch of dashboards, we can just get a list and then send diffs with config changes to the appropriate teams or write a script that takes the output of the query and automatically writes the diff. The above query provides a pattern for any basic utilization numbers or rates; you could look at memory usage, new or old gen GC frequency, etc., with similar queries. In one case, we found a service that was wasting enough RAM to pay my salary for a decade.

I've been moving away from using thresholds against simple percentiles to find issues, but I'm presenting this query because this is a thing people commonly want to do that's useful and I can write this without having to spend a lot of space explain why it's a reasonable thing to do; what I prefer to do instead is out of scope of this post and probably deserves its own post.

Network utilization

The above query was over all services, but we can also query across hosts. In addition, we can do queries that join against properties of the host, feature flags, etc.

Using one set of queries, we were able to determine that we had a significant number of services running up against network limits even though host-level network utilization was low. The compute platform team then did a gradual rollout of a change to network caps, which we monitored with queries like the one below to determine that we weren't see any performance degradation (theoretically possible if increasing network caps caused hosts or switches to hit network limits).

With the network change, we were able to observe, smaller queue depths, smaller queue size (in bytes), fewer packet drops, etc.

The query below only shows queue depths for brevity; adding all of the quantities mentioned is just a matter of typing more names in.

The general thing we can do is, for any particular rollout of a platform or service-level feature, we can see the impact on real services.

with rolled as (
 select
   -- rollout was fixed for all hosts during the time period, can pick an arbitrary element from the time period
   arbitrary(element_at(misc, 'egress_rate_limit_increase')) as rollout,
   hostId
 from ltm_deploys
 where ds = '2019-10-10'
 and zone = 'foo'
 group by ipAddress
), host_info as(
 select
   arbitrary(nicSpeed) as nicSpeed,
   hostId
 from ltm_host
 where ds = '2019-10-10'
 and zone = 'foo'
 group by ipAddress
), host_rolled as (
 select
   rollout,
   nicSpeed,
   rolled.hostId
 from rolled
 join host_info on rolled.ipAddress = host_info.ipAddress
), container_metrics as (
 select
   service,
   netTxQlen,
   hostId
 from ltm_container
 where ds >= '2019-10-10' and ds <= '2019-10-14'
 and zone = 'foo'
)
select
 service,
 nicSpeed,
 approx_percentile(netTxQlen, 1, 0.999, 0.0001) as p999_qlen,
 approx_percentile(netTxQlen, 1, 0.99, 0.001) as p99_qlen,
 approx_percentile(netTxQlen, 0.9) as p90_qlen,
 approx_percentile(netTxQlen, 0.68) as p68_qlen,
 rollout,
 count(*) as cnt
from container_metrics
join host_rolled on host_rolled.hostId = container_metrics.hostId
group by service, nicSpeed, rollout

Other questions that became easy to answer

Design decisions

LTM is about as boring a system as is possible. Every design decision falls out of taking the path of least resistance.

Boring technology

I think writing about systems like this, that are just boring work is really underrated. A disproportionate number of posts and talks I read are about systems using hot technologies. I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing. Since posts and talks about boring work are relatively rare, I think writing up something like this is more useful than it has any right to be.

For example, a couple years ago, at a local meetup that Matt Singer organizes for companies in our size class to discuss infrastructure (basically, companies that are smaller than FB/Amazon/Google) I asked if anyone was doing something similar to what we'd just done. No one who was there was (or not who'd admit to it, anyway), and engineers from two different companies expressed shock that we could store so much data, and not just the average per time period, but some histogram information as well. This work is too straightforward and obvious to be novel, I'm sure people have built analogous systems in many places. It's literally just storing metrics data on HDFS (or, if you prefer a more general term, a data lake) indefinitely in a format that allows interactive queries.

If you do the math on the cost of metrics data storage for a project like this in a company in our size class, the storage cost is basically a rounding error. We've shipped individual diffs that easily pay for the storage cost for decades. I don't think there's any reason storing a few years or even a decade worth of metrics should be shocking when people deploy analytics and observability tools that cost much more all the time. But it turns out this was surprising, in part because people don't write up work this boring.

An unrelated example is that, a while back, I ran into someone at a similarly sized company who wanted to get similar insights out of their metrics data. Instead of starting with something that would take a day, like this project, they started with deep learning. While I think there's value in applying ML and/or stats to infra metrics, they turned a project that could return significant value to the company after a couple of person-days into a project that took person-years. And if you're only going to either apply simple heuristics guided by someone with infra experience and simple statistical models or naively apply deep learning, I think the former has much higher ROI. Applying both sophisticated stats/ML and practitioner guided heuristics together can get you better results than either alone, but I think it makes a lot more sense to start with the simple project that takes a day to build out and maybe another day or two to start to apply than to start with a project that takes months or years to build out and start to apply. But there are a lot of biases towards doing the larger project: it makes a better resume item (deep learning!), in many places, it makes a better promo case, and people are more likely to give a talk or write up a blog post on the cool system that uses deep learning.

The above discusses why writing up work is valuable for the industry in general. We covered why writing up work is valuable to the company doing the write-up in a previous post, so I'm not going to re-hash that here.

Appendix: stuff I screwed up

I think it's unfortunate that you don't get to hear about the downsides of systems without backchannel chatter, so here are things I did that are pretty obvious mistakes in retrospect. I'll add to this when something else becomes obvious in retrospect.

These are the kind of thing you expect when you crank out something quickly and don't think it through enough. The last item is trivial to fix and not much of a problem since the ubiquitous use of IDEs at Twitter means that basically anyone who would be impacted will have their IDE supply the correct capitalization for them.

The first item is more problematic, both in that it could actually cause incorrect analyses and in that fixing it will require doing a migration of all the data we have. My guess is that, at this point, this will be half a week to a week of work, which I could've easily avoided by spending thirty more seconds thinking through what I was doing.

The second item is somewhere in between. Between the first and second items, I think I've probably signed up for roughly double the amount of direct work on this system (so, not including time spent on data analysis on data in the system, just the time spent to build the system) for essentially no benefit.

Thanks to Leah Hanson, Andy Wilcox, Lifan Zeng, and Matej Stuchlik for comments/corrections/discussion


  1. The actual work involved was about a day's work, but it was done over a week since I had to learn Scala as well as Scalding and the general Twitter stack, the metrics stack, etc.

    One day is also just an estimate for the work for the initial data sets. Since then, I've done probably a couple more weeks of work and Wesley Aptekar-Cassels and Kunal Trivedi have probably put in another week or two of time. The opertional cost is probably something like 1-2 days of my time per month (on average), bringing the total cost to on the order a month or two.

    I'm also not counting time spent using the dataset, or time spent debugging issues, which will include a lot of time that I can only roughly guess at, e.g., when the compute platform team changed the network egress limits as a result of some data analysis that took about an hour, that exposed a latent mesos bug that probably cost a day of Ilya Pronin's time, David Mackey has spent a fair amount of time tracking down weird issues where the data shows something odd is going on, but we don't know what is, etc. If you wanted to fully account for time spent on work that came out of some data analysis on the data sets discussed in the post, I suspect, between service-level teams, plus platform-level teams like our JVM, OS, and HW teams, we're probably at roughly 1 person-year of time.

    But, because the initial work it took to create a working and useful system was a day plus time spent working on orientation material and the system returned seven figures, it's been very easy to justify all of this additional time spent, which probably wouldn't have been the case if a year of up-front work was required. Most of the rest of the time isn't the kind of thing that's usually "charged" on roadmap reviews on creating a system (time spent by users, operational overhead), but perhaps the ongoing operational cost shlould be "charged" when creating the system (I don't think it makes sense to "charge" time spent by users to the system since, the more useful a system is, the more time users will spend using it, that doesn't really seem like a cost).

    There'a also been work to build tools on top of this, Kunal Trivedi has spent a fair amount of time building a layer on top of this to make the presentation more user friendly than SQL queries, which could arguably be charged to this project.

    [return]

How (some) good corporate engineering blogs are written

2020-03-11 08:00:00

I've been comparing notes with people who run corporate engineering blogs and one thing that I think is curious is that it's pretty common for my personal blog to get more traffic than the entire corp eng blog for a company with a nine to ten figure valuation and it's not uncommon for my blog to get an order of magnitude more traffic.

I think this is odd because tech companies in that class often have hundreds to thousands of employees. They're overwhelmingly likely to be better equipped to write a compelling blog than I am and companies get a lot more value from having a compelling blog than I do.

With respect to the former, employees of the company will have done more interesting engineering work, have more fun stories, and have more in-depth knowledge than any one person who has a personal blog. On the latter, my blog helps me with job searching and it helps companies hire. But I only need one job, so more exposure, at best, gets me a slightly better job, whereas all but one tech company I've worked for is desperate to hire and loses candidates to other companies all the time. Moreover, I'm not really competing against other candidates when I interview (even if we interview for the same job, if the company likes more than one of us, it will usually just make more jobs). The high-order bit on this blog with respect to job searching is whether or not the process can take significant non-interview feedback or if I'll fail the interview because they do a conventional interview and the marginal value of an additional post is probably very low with respect to that. On the other hand, companies compete relatively directly when recruiting, so being more compelling relative to another company has value to them; replicating the playbook Cloudflare or Segment has used with their engineering "brands" would be a significant recruiting advantage. The playbook isn't secret: these companies broadcast their output to the world and are generally happy to talk about their blogging process.

Despite the seemingly obvious benefits of having a "good" corp eng blog, most corp eng blogs are full of stuff engineers don't want to read. Vague, high-level fluff about how amazing everything is, content marketing, handwave-y posts about the new hotness (today, that might be using deep learning for inappropriate applications; ten years ago, that might have been using "big data" for inappropriate applications), etc.

To try to understand what companies with good corporate engineering blog have in common, I interviewed folks at three different companies that have compelling corporate engineering blogs (Cloudflare, Heap, and Segment) as well as folks at three different companies that have lame corporate engineering blogs (which I'm not going to name).

At a high level, the compelling engineering blogs had processes that shared the following properties:

The less compelling engineering blogs had processes that shared the following properties:

One person at a company with a compelling blog noted that a downside of having only one approver and/or one primary approver is that if that person is busy, it can takes weeks to get posts approved. That's fair, that's a downside of having centralized approval. However, when we compare to the alternative processes, at one company, people noted that it's typical for approvals to take three to six months and tail cases can take a year.

While a few weeks can seem like a long time for someone used to a fast moving company, people at slower moving companies would be ecstatic to have an approval process that only takes twice that long.

Here are the processes, as described to me, for the three companies I interviewed (presented in sha512sum order, which is coincidentally ordered by increasing size of company, from a couple hundred employees to nearly one thousand employees):

Heap

The first editing phase used to involve posting a draft to a slack channel where "everyone" would comment on the post. This was an unpleasant experience since "everyone" would make comments and a lot of revision would be required. This process was designed to avoid getting "too much" feedback.

Segment

Some changes that have been made include

Although there's legal and PR approval, Calvin noted "In general we try to keep it fairly lightweight. I see the bigger problem with blogging being a lack of posts or vague, high level content which isn't interesting rather than revealing too much."

Cloudflare

One thing to note is that this only applies to technical blog posts. Product announcements have a heavier process because they're tied to sales material, press releases, etc.

One thing I find interesting is that Marek interviewed at Cloudflare because of their blog (this 2013 blog post on their 4th generation servers caught his eye) and he's now both a key engineer for them as well as one of the main sources of compelling Cloudflare blog posts. At this point, the Cloudflare blog has generated at least a few more generations of folks who interviewed because they saw a blog post and now write compelling posts for the blog.

General comments

My opinion is that the natural state of a corp eng blog where people get a bit of feedback is a pretty interesting blog. There's a dearth of real, in-depth, technical writing, which makes any half decent, honest, public writing about technical work interesting.

In order to have a boring blog, the corporation has to actively stop engineers from putting interesting content out there. Unfortunately, it appears that the natural state of large corporations tends towards risk aversion and blocking people from writing, just in case it causes a legal or PR or other problem. Individual contributors (ICs) might have the opinion that it's ridiculous to block engineers from writing low-risk technical posts while, simultaneously, C-level execs and VPs regularly make public comments that turn into PR disasters, but ICs in large companies don't have the authority or don't feel like they have the authority to do something just because it makes sense. And none of the fourteen stakeholders who'd have to sign off on approving a streamlined process care about streamlining the process since that would be good for the company in a way that doesn't really impact them, not when that would mean seemingly taking responsibility for the risk a streamlined process would add, however small. An exec or a senior VP willing to take a risk can take responsibility for the fallout and, if they're interested in engineering recruiting or morale, they may see a reason to do so.

One comment I've often heard from people at more bureaucratic companies is something like "every company our size is like this", but that's not true. Cloudflare, a $6B company approaching 1k employees is in the same size class as many other companies with a much more onerous blogging process. The corp eng blog situation seems similar to situation on giving real interview feedback. interviewing.io claims that there's significant upside and very little downside to doing so. Some companies actually do give real feedback and the ones that do generally find that it gives them an easy advantage in recruiting with little downside, but the vast majority of companies don't do this and people at those companies will claim that it's impossible to do give feedback since you'll get sued or the company will be "cancelled" even though this generally doesn't happen to companies that give feedback and there are even entire industries where it's common to give interview feedback. It's easy to handwave that some risk exists and very few people have the authority to dismiss vague handwaving about risk when it's coming from multiple orgs.

Although this is a small sample size and it's dangerous to generalize too much from small samples, the idea that you need high-level support to blast through bureaucracy is consistent with what I've seen in other areas where most large companies have a hard time doing something easy that has obvious but diffuse value. While this post happens to be about blogging, I've heard stories that are the same shape on a wide variety of topics.

Appendix: examples of compelling blog posts

Here are some blog posts from the blogs mentioned with a short comment on why I thought the post was compelling. This time, in reverse sha512 hash order.

Cloudflare

Segment

Heap

One thing to note is that these blogs all have different styles. Personally, I prefer the style of Cloudflare's blog, which has a higher proportion of "deep dive" technical posts, but different people will prefer different styles. There are a lot of styles that can work.

Thanks to Marek Majkowski, Kamal Marhubi, Calvin French-Owen, John Graham-Cunning, Michael Malis, Matthew Prince, Yuri Vishnevsky, Julia Evans, Wesley Aptekar-Cassels, Nathan Reed, Jake Seliger, an anonymous commenter, plus sources from the companies I didn't name for comments/corrections/discussion; none of the people explicitly mentioned in the acknowledgements were sources for information on the less compelling blogs

The growth of command line options, 1979-Present

2020-03-03 08:00:00

My hobby: opening up McIlroy’s UNIX philosophy on one monitor while reading manpages on the other.

The first of McIlroy's dicta is often paraphrased as "do one thing and do it well", which is shortened from "Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new 'features.'"

McIlroy's example of this dictum is:

Surprising to outsiders is the fact that UNIX compilers produce no listings: printing can be done better and more flexibly by a separate program.

If you open up a manpage for ls on mac, you’ll see that it starts with

ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

That is, the one-letter flags to ls include every lowercase letter except for {jvyz}, 14 uppercase letters, plus @ and 1. That’s 22 + 14 + 2 = 38 single-character options alone.

On ubuntu 17, if you read the manpage for coreutils ls, you don’t get a nice summary of options, but you’ll see that ls has 58 options (including --help and --version).

To see if ls is an aberration or if it's normal to have commands that do this much stuff, we can look at some common commands, sorted by frequency of use.

command1979199620152017
ls11425858
rm371112
mkdir0467
mv091314
cp0183032
cat1121212
pwd0244
chmod0699
echo1455
man5163940
which011
sudo02325
tar1253134139
touch191111
clear000
find14578282
ln0111516
ps4228585
ping121229
kill1333
ifconfig162525
chown061515
grep11224545
tail171213
df0101718
top61214

This table has the number of command line options for various commands for v7 Unix (1979), slackware 3.1 (1996), ubuntu 12 (2015), and ubuntu 17 (2017). Cells are darker and blue-er when they have more options (log scale) and are greyed out if no command was found.

We can see that the number of command line options has dramatically increased over time; entries tend to get darker going to the right (more options) and there are no cases where entries get lighter (fewer options). 

McIlroy has long decried the increase in the number of options, size, and general functionality of commands1:

Everything was small and my heart sinks for Linux when I see the size [inaudible]. The same utilities that used to fit in eight k[ilobytes] are a meg now. And the manual page, which used to really fit on, which used to really be a manual page, is now a small volume with a thousand options... We used to sit around in the UNIX room saying "what can we throw out? Why is there this option?" It's usually, it's often because there's some deficiency in the basic design — you didn't really hit the right design point. Instead of putting in an option, figure out why, what was forcing you to add that option. This viewpoint, which was imposed partly because there was very small hardware ... has been lost and we're not better off for it.

Ironically, one of the reasons for the rise in the number of command line options is another McIlroy dictum, "Write programs to handle text streams, because that is a universal interface" (see ls for one example of this).

If structured data or objects were passed around, formatting could be left to a final formatting pass. But, with plain text, the formatting and the content are intermingled; because formatting can only be done by parsing the content out, it's common for commands to add formatting options for convenience. Alternately, formatting can be done when the user leverages their knowledge of the structure of the data and encodes that knowledge into arguments to cut, awk, sed, etc. (also using their knowledge of how those programs handle formatting; it's different for different programs and the user is expected to, for example, know how cut -f4 is different from awk '{ print $4 }'2). That's a lot more hassle than passing in one or two arguments to the last command in a sequence and it pushes the complexity from the tool to the user.

People sometimes say that they don't want to support structured data because they'd have to support multiple formats to make a universal tool, but they already have to support multiple formats to make a universal tool. Some standard commands can't read output from other commands because they use different formats, wc -w doesn't handle Unicode correctly, etc. Saying that "text" is a universal format is like saying that "binary" is a universal format.

I've heard people say that there isn't really any alternative to this kind of complexity for command line tools, but people who say that have never really tried the alternative, something like PowerShell. I have plenty of complaints about PowerShell, but passing structured data around and easily being able to operate on structured data without having to hold metadata information in my head so that I can pass the appropriate metadata to the right command line tools at that right places the pipeline isn't among my complaints3.

The sleight of hand that's happening when someone says that we can keep software simple and compatible by making everything handle text is the pretense that text data doesn't have a structure that needs to be parsed4. In some cases, we can just think of everything as a single space separated line, or maybe a table with some row and column separators that we specify (with some behavior that isn't consistent across tools, of course). That adds some hassle when it works, and then there are the cases where serializing data to a flat text format adds considerable complexity since the structure of data means that simple flattening requires significant parsing work to re-ingest the data in a meaningful way.

Another reason commands now have more options is that people have added convenience flags for functionality that could have been done by cobbling together a series of commands. These go all the way back to v7 unix, where ls has an option to reverse the sort order (which could have been done by passing the output to something like tac had they written tac instead of adding a special-case reverse option).

Over time, more convenience options have been added. For example, to pick a command that originally has zero options, mv can move and create a backup (three options; two are different ways to specify a backup, one of which takes an argument and the other of which takes zero explicit arguments and reads an implicit argument from the VERSION_CONTROL environment variable; one option allows overriding the default backup suffix). mv now also has options to never overwrite and to only overwrite if the file is newer.

mkdir is another program that used to have no options where, excluding security things for SELinux or SMACK as well as help and version options, the added options are convenience flags: setting the permissions of the new directory and making parent directories if they don't exist.

If we look at tail, which originally had one option (-number, telling tail where to start), it's added both formatting and convenience options For formatting, it has -z, which makes the line delimiter null instead of a newline. Some examples of convenience options are -f to print when there are new changes, -s to set the sleep interval between checking for -f changes, --retry to retry if the file isn't accessible.

McIlroy says "we're not better off" for having added all of these options but I'm better off. I've never used some of the options we've discussed and only rarely use others, but that's the beauty of command line options — unlike with a GUI, adding these options doesn't clutter up the interface. The manpage can get cluttered, but in the age of google and stackoverflow, I suspect many people just search for a solution to what they're trying to do without reading the manpage anyway.

This isn't to say there's no cost to adding options — more options means more maintenance burden, but that's a cost that maintainers pay to benefit users, which isn't obviously unreasonable considering the ratio of maintainers to users. This is analogous to Gary Bernhardt's comment that it's reasonable to practice a talk fifty times since, if there's a three hundred person audience, the ratio of time spent watching to the talk to time spent practicing will still only be 1:6. In general, this ratio will be even more extreme with commonly used command line tools.

Someone might argue that all these extra options create a burden for users. That's not exactly wrong, but that complexity burden was always going to be there, it's just a question of where the burden was going to lie. If you think of the set of command line tools along with a shell as forming a language, a language where anyone can write a new method and it effectively gets added to the standard library if it becomes popular, where standards are defined by dicta like "write programs to handle text streams, because that is a universal interface", the language was always going to turn into a write-only incoherent mess when taken as a whole. At least with tools that bundle up more functionality and options than is UNIX-y users can replace a gigantic set of wildly inconsistent tools with a merely large set of tools that, while inconsistent with each other, may have some internal consistency.

McIlroy implies that the problem is that people didn't think hard enough, the old school UNIX mavens would have sat down in the same room and thought longer and harder until they came up with a set of consistent tools that has "unusual simplicity". But that was never going to scale, the philosophy made the mess we're in inevitable. It's not a matter of not thinking longer or harder; it's a matter of having a philosophy that cannot scale unless you have a relatively small team with a shared cultural understanding, able to to sit down in the same room.

Many of the main long-term UNIX anti-features and anti-patterns that we're still stuck with today, fifty years later, come from the "we should all act like we're in the same room" design philosophy, which is the opposite of the approach you want if you want to create nice, usable, general, interfaces that can adapt to problems that the original designers didn't think of. For example, it's a common complain that modern shells and terminals lack a bunch of obvious features that anyone designing a modern interface would want. When you talk to people who've written a new shell and a new terminal with modern principles in mind, like Jesse Luehrs, they'll note that a major problem is that the UNIX model doesn't have a good seperation of interface and implementation, which works ok if you're going to write a terminal that acts in the same way that a terminal that was created fifty years ago acts, but is immediately and obviously problematic if you want to build a modern terminal. That design philosophy works fine if everyone's in the same room and the system doesn't need to scale up the number of contributors or over time, but that's simply not the world we live in today.

If anyone can write a tool and the main instruction comes from "the unix philosophy", people will have different opinions about what "simplicity" or "doing one thing"5 means, what the right way to do things is, and inconsistency will bloom, resulting in the kind of complexity you get when dealing with a wildly inconsistent language, like PHP. People make fun of PHP and javascript for having all sorts of warts and weird inconsistencies, but as a language and a standard library, any commonly used shell plus the collection of widely used *nix tools taken together is much worse and contains much more accidental complexity due to inconsistency even within a single Linux distro and there's no other way it could have turned out. If you compare across Linux distros, BSDs, Solaris, AIX, etc., the amount of accidental complexity that users have to hold in their heads when switching systems dwarfs PHP or javascript's incoherence. The most widely mocked programming languages are paragons of great design by comparison.

To be clear, I'm not saying that I or anyone else could have done better with the knowledge available in the 70s in terms of making a system that was practically useful at the time that would be elegant today. It's easy to look back and find issues with the benefit of hindsight. What I disagree with are comments from Unix mavens speaking today; comments like McIlroy's, which imply that we just forgot or don't understand the value of simplicity, or Ken Thompson saying that C is as safe a language as any and if we don't want bugs we should just write bug-free code. These kinds of comments imply that there's not much to learn from hindsight; in the 70s, we were building systems as effectively as anyone can today; five decades of collective experience, tens of millions of person-years, have taught us nothing; if we just go back to building systems like the original Unix mavens did, all will be well. I respectfully disagree.

Appendix: memory

Although addressing McIlroy's complaints about binary size bloat is a bit out of scope for this, I will note that, in 2017, I bought a Chromebook that had 16GB of RAM for $300. A 1 meg binary might have been a serious problem in 1979, when a standard Apple II had 4KB. An Apple II cost $1298 in 1979 dollars, or $4612 in 2020 dollars. You can get a low end Chromebook that costs less than 1/15th as much which has four million times more memory. Complaining that memory usage grew by a factor of one thousand when a (portable!) machine that's more than an order of magnitude cheaper has four million times more memory seems a bit ridiculous.

I prefer slimmer software, which is why I optimized my home page down to two packets (it would be a single packet if my CDN served high-level brotli), but that's purely an aesthetic preference, something I do for fun. The bottleneck for command line tools isn't memory usage and spending time optimizing the memory footprint of a tool that takes one meg is like getting a homepage down to a single packet. Perhaps a fun hobby, but not something that anyone should prescribe.

Methodology for table

Command frequencies were sourced from public command history files on github, not necessarily representative of your personal usage. Only "simple" commands were kept, which ruled out things like curl, git, gcc (which has > 1000 options), and wget. What's considered simple is arbitrary. Shell builtins, like cd weren't included.

Repeated options aren't counted as separate options. For example, git blame -C, git blame -C -C, and git blame -C -C -C have different behavior, but these would all be counted as a single argument even though -C -C is effectively a different argument from -C.

The table counts sub-options as a single option. For example, ls has the following:

--format=WORD across -x, commas -m,  horizontal  -x,  long  -l,  single-column  -1,  verbose  -l, vertical -C

Even though there are seven format options, this is considered to be only one option.

Options that are explicitly listed as not doing anything are still counted as options, e.g., ls -g, which reads Ignored; for Unix compatibility. is counted as an option.

Multiple versions of the same option are also considered to be one option. For example, with ls, -A and --almost-all are counted as a single option.

In cases where the manpage says an option is supposed to exist, but doesn't, the option isn't counted. For example, the v7 mv manpage says

BUGS

If file1 and file2 lie on different file systems, mv must copy the file and delete the original. In this case the owner name becomes that of the copying process and any linking relationship with other files is lost.

Mv should take -f flag, like rm, to suppress the question if the target exists and is not writable.

-f isn't counted as a flag in the table because the option doesn't actually exist.

The latest year in the table is 2017 because I wrote the first draft for this post in 2017 and didn't get around to cleaning it up until 2020.

mjd on the Unix philosophy, with an aside into the mess of /usr/bin/time vs. built-in time.

mjd making a joke about the proliferation of command line options in 1991.

On HN:

p1mrx:

It's strange that ls has grown to 58 options, but still can't output \0-terminated filenames

As an exercise, try to sort a directory by size or date, and pass the result to xargs, while supporting any valid filename. I eventually just gave up and made my script ignore any filenames containing \n.

whelming_wave:

Here you go: sort all files in the current directory by modification time, whitespace-in-filenames-safe. The printf (od -> sed)' construction converts back out of null-separated characters into newline-separated, though feel free to replace that with anything accepting null-separated input. Granted,sort --zero-terminated' is a GNU extension and kinda cheating, but it's even available on macOS so it's probably fine.

      printf '%b' $(
        find . -maxdepth 1 -exec sh -c '
          printf '\''%s %s\0'\'' "$(stat -f '\''%m'\'' "$1")" "$1"
        ' sh {} \; | \
        sort --zero-terminated | \
        od -v -b | \
        sed 's/^[^ ]*//
      s/ *$//
      s/  */ \\/g
      s/\\000/\\012/g')

If you're running this under zsh, you'll need to prefix it with `command' to use the system executable: zsh's builtin printf doesn't support printing octal escape codes for normally printable characters, and you may have to assign the output to a variable and explicitly word-split it.

This is all POSIX as far as I know, except for the sort.

The Unix haters handbook.

Why create a new shell?

Thanks to Leah Hanson, Jesse Luehrs, Hillel Wayne, Wesley Aptekar-Cassels, Mark Jason Dominus, Travis Downs, and Yuri Vishnevsky for comments/corrections/discussion.


  1. This quote is slightly different than the version I've seen everywhere because I watched the source video. AFAICT, every copy of this quote that's on the internet (indexed by Bing, DuckDuckGo, or Google) is a copy of one person's transcription of the quote. There's some ambiguity because the audio is low quality and I hear something a bit different than whoever transcribed that quote heard. [return]
  2. Another example of something where the user absorbs the complexity because different commands handle formatting differently is time formatting — the shell builtin time is, of course, inconsistent with /usr/bin/time and the user is expected to know this and know how to handle it. [return]
  3. Just for example, you can use ConvertTo-Json or ConvertTo-CSV on any object, you can use cmdlets to change how properties are displayed for objects, and you can write formatting configuration files that define how you prefer things to be formatted.

    Another way to look at this is through the lens of Conway's law. If we have a set of command line tools that are built by different people, often not organizationally connected, the tools are going to be wildly inconsistent unless someone can define a standard and get people to adopt it. This actually works relatively well on Windows, and not just in PowerShell.

    A common complaint about Microsoft is that they've created massive API churn, often for non-technical organizational reasons (e.g., a Sinofsky power play, like the one described in the replies to the now-deleted Tweet at https://twitter.com/stevesi/status/733654590034300929). It's true. Even so, from the standpoint of a naive user, off-the-shelf Windows software is generally a lot better at passing non-textual data around than *nix. One thing this falls out of is Windows's embracing of non-textual data, which goes back at least to COM in 1999 (and arguably OLE and DDE, released in 1990 and 1987, respectively).

    For example, if you copy from Foo, which supports binary formats A and B, into Bar, which supports formats B and C and you then copy from Bar into Baz, which supports C and D, this will work even though Foo and Baz have no commonly supported formats. 

    When you cut/copy something, the application basically "tells" the clipboard what formats it could provide data in. When you paste into the application, the destination application can request the data in any of the formats in which it's available. If the data is already in the clipboard, "Windows" provides it. If it isn't, Windows gets the data from the source application and then gives to the destination application and a copy is saved for some length of time in Windows. If you "cut" from Excel it will tell "you" that it has the data available in many tens of formats. This kind of system is pretty good for compatibility, although it definitely isn't simple or minimal.

    In addition to nicely supporting many different formats and doing so for long enough that a lot of software plays nicely with this, Windows also generally has nicer clipboard support out of the box.

    Let's say you copy and then paste a small amount of text. Most of the time, this will work like you'd expect on both Windows and Linux. But now let's say you copy some text, close the program you copied from, and then paste it. A mental model that a lot of people have is that when they copy, the data is stored in the clipboard, not in the program being copied from. On Windows, software is typically written to conform to this expectation (although, technically, users of the clipboard API don't have to do this). This is less common on Linux with X, where the correct mental model for most software is that copying stores a pointer to the data, which is still owned by the program the data was copied from, which means that paste won't work if the program is closed. When I've (informally) surveyed programmers, they're usually surprised by this if they haven't actually done copy+paste related work for an application. When I've surveyed non-programmers, they tend to find the behavior to be confusing as well as surprising.

    The downside of having the OS effectively own the contents of the clipboard is that it's expensive to copy large amounts of data. Let's say you copy a really large amount of text, many gigabytes, or some complex object and then never paste it. You don't really want to copy that data from your program into the OS so that it can be available. Windows also handles this reasonably: applications can provide data only on request when that's deemed advantageous. In the case mentioned above, when someone closes the program, the program can decide whether or not it should push that data into the clipboard or discard it. In that circumstance, a lot of software (e.g., Excel) will prompt to "keep" the data in the clipboard or discard it, which is pretty reasonable.

    It's not impossible to support some of this on Linux. For example, the ClipboardManager spec describes a persistence mechanism and GNOME applications generally kind of sort of support it (although there are some bugs) but the situation on *nix is really different from the more pervasive support Windows applications tend to have for nice clipboard behavior.

    [return]
  4. Another example of this are tools that are available on top of modern compilers. If we go back and look at McIlroy's canonical example, how proper UNIX compilers are so specialized that listings are a separate tool, we can see that this has changed even if there's still a separate tool you can use for listings. Some commonly used Linux compilers have literally thousands of options and do many things. For example, one of the many things clang now does is static analysis. As of this writing, there are 79 normal static analysis checks and 44 experimental checks. If these were separate commands (perhaps individual commands or perhaps a static_analysis command, they'd still rely on the same underlying compiler infrastructure and impose the same maintenance burden — it's not really reasonable to have these static analysis tools operate on plain text and reimplement the entire compiler toolchain necessary to get the point where they can do static analysis. They could be separate commands instead of bundled into clang, but they'd still take a dependency on the same machinery that's used for the compiler and either impose a maintenance and complexity burden on the compiler (which has to support non-breaking interfaces for the tools built on top) or they'd break all the time.

    Just make everything text so that it's simple makes for a nice soundbite, but in reality the textual representation of the data is often not what you want if you want to do actually useful work.

    And on clang in particular, whether you make it a monolithic command or thousands of smaller commands, clang simply does more than any compiler that existed in 1979 or even all compilers that existed in 1979 combined. It's easy to say that things were simpler in 1979 and that us modern programmers have lost our way. It's harder to actually propose a design that's actually much simpler and could really get adopted. It's impossible that such a design could maintain all of the existing functionality and configurability and be as simple as something from 1979.

    [return]
  5. Since its inception, curl has gone from supporting 3 protocols to 40. Does that mean it does 40 things and it would be more "UNIX-y" to split it up into 40 separate commands? Depends on who you ask. If each protocol were its own command, created and maintained by a different person, we'd be in the same situation we are with other commands. Inconsistent command line options, inconsistent output formats despite it all being text streams, etc. Would that be closer to the simplicity McIlroy advocates for? Depends on who you ask. [return]

Suspicious discontinuities

2020-02-18 08:00:00

If you read any personal finance forums late last year, there's a decent chance you ran across a question from someone who was desperately trying to lose money before the end of the year. There are a number of ways someone could do this; one commonly suggested scheme was to buy put options that were expected to expire worthless, allowing the buyer to (probably) take a loss.

One reason people were looking for ways to lose money was that, in the U.S., there's a hard income cutoff for a health insurance subsidy at $48,560 for individuals (higher for larger households; $100,400 for a family of four). There are a number of factors that can cause the details to vary (age, location, household size, type of plan), but across all circumstances, it wouldn't have been uncommon for an individual going from one side of the cut-off to the other to have their health insurance cost increase by roughly $7200/yr. That means if an individual buying ACA insurance was going to earn $55k, they'd be better off reducing their income by $6440 and getting under the $48,560 subsidy ceiling than they are earning $55k.

Although that's an unusually severe example, U.S. tax policy is full of discontinuities that disincentivize increasing earnings and, in some cases, actually incentivize decreasing earnings. Some other discontinuities are the TANF income limit, the Medicaid income limit, the CHIP income limit for free coverage, and the CHIP income limit for reduced-cost coverage. These vary by location and circumstance; the TANF and Medicaid income limits fall into ranges generally considered to be "low income" and the CHIP limits fall into ranges generally considered to be "middle class". These subsidy discontinuities have the same impact as the ACA subsidy discontinuity -- at certain income levels, people are incentivized to lose money.

Anyone may arrange his affairs so that his taxes shall be as low as possible; he is not bound to choose that pattern which best pays the treasury. There is not even a patriotic duty to increase one's taxes. Over and over again the Courts have said that there is nothing sinister in so arranging affairs as to keep taxes as low as possible. Everyone does it, rich and poor alike and all do right, for nobody owes any public duty to pay more than the law demands.

If you agree with the famous Learned Hand quote then losing money in order to reduce effective tax rate, increasing disposable income, is completely legitimate behavior at the individual level. However, a tax system that encourages people to lose money, perhaps by funneling it to (on average) much wealthier options traders by buying put options, seems sub-optimal.

A simple fix for the problems mentioned above would be to have slow phase-outs instead of sharp thresholds. Slow phase-outs are actually done for some subsidies and, while that can also have problems, they are typically less problematic than introducing a sharp discontinuity in tax/subsidy policy.

In this post, we'll look at a variety of discontinuities.

Hardware or software queues

A naive queue has discontinuous behavior. If the queue is full, new entries are dropped. If the queue isn't full, new entries are not dropped. Depending on your goals, this can often have impacts that are non-ideal. For example, in networking, a naive queue might be considered "unfair" to bursty workloads that have low overall bandwidth utilization because workloads that have low bandwidth utilization "shouldn't" suffer more drops than workloads that are less bursty but use more bandwidth (this is also arguably not unfair, depending on what your goals are).

A class of solutions to this problem are random early drop and its variants, which gives incoming items a probability of being dropped which can be determined by queue fullness (and possibly other factors), smoothing out the discontinuity and mitigating issues caused by having a discontinuous probability of queue drops.

This post on voting in link aggregators is fundamentally the same idea although, in some sense, the polarity is reversed. There's a very sharp discontinuity in how much traffic something gets based on whether or not it's on the front page. You could view this as a link getting dropped from a queue if it only receives N-1 votes and not getting dropped if it receives N votes.

College admissions and Pell Grant recipients

Pell Grants started getting used as a proxy for how serious schools are about helping/admitting low-income students. The first order impact is that students above the Pell Grant threshold had a significantly reduced probability of being admitted while students below the Pell Grant threshold had a significantly higher chance of being admitted. Phrased that way, it sounds like things are working as intended.

However, when we look at what happens within each group, we see outcomes that are the opposite of what we'd want if the goal is to benefit students from low income families. Among people who don't qualify for a Pell Grant, it's those with the lowest income who are the most severely impacted and have the most severely reduced probability of admission. Among people who do qualify, it's those with the highest income who are mostly likely to benefit, again the opposite of what you'd probably want if your goal is to benefit students from low income families.

We can see these in the graphs below, which are histograms of parental income among students at two universities in 2008 (first graph) and 2016 (second graph), where the red line indicates the Pell Grant threshold.

Histogram of income distribution of students at two universities in 2008; high incomes are highly overrepresented relative to the general population, but the distribution is smooth

Histogram of income distribution of students at two universities in 2016; high incomes are still highly overrepresented, there's also a sharp discontinuity at the Pell grant threshold; plot looks roughly two upwards sloping piecewise linear functions, with a drop back to nearly 0 at the discontinuity at the Pell grant threshold

A second order effect of universities optimizing for Pell Grant recipients is that savvy parents can do the same thing that some people do to cut their taxable income at the last minute. Someone might put money into a traditional IRA instead of a Roth IRA and, if they're at their IRA contribution limit, they can try to lose money on options, effectively transferring money to options traders who are likely to be wealthier than them, in order to bring their income below the Pell Grant threshold, increasing the probability that their children will be admitted to a selective school.

Election statistics

The following histograms of Russian elections across polling stations shows curious spikes in turnout and results at nice, round, numbers (e.g., 95%) starting around 2004. This appears to indicate that there's election fraud via fabricated results and that at least some of the people fabricating results don't bother with fabricating results that have a smooth distribution.

For finding fraudulent numbers, also see, Benford's law.

Used car sale prices

Mark Ainsworth points out that there are discontinuities at $10k boundaries in U.S. auto auction sales prices as well as volume of vehicles offered at auction. The price graph below adjusts for a number of factors such as model year, but we can see the same discontinuities in the raw unadjusted data.

Graph of car sales prices at auction, showing discontinuities described above

Graph of car volumes at auction, showing discontinuities described above for dealer sales to auction but not fleet sales to auction

p-values

Authors of psychology papers are incentivized to produce papers with p values below some threshold, usually 0.05, but sometimes 0.1 or 0.01. Masicampo et al. plotted p values from papers published in three psychology journals and found a curiously high number of papers with p values just below 0.05.

Histogram of published p-values; spike at p=0.05

The spike at p = 0.05 consistent with a number of hypothesis that aren't great, such as:

Head et al. (2015) surveys the evidence across a number of fields.

Andrew Gelman and others have been campaigning to get rid of the idea of statistical significance and p-value thresholds for years, see this paper for a short summary of why. Not only would this reduce the incentive for authors to cheat on p values, there are other reasons to not want a bright-line rule to determine if something is "significant" or not.

Drug charges

The top two graphs in this set of four show histograms of the amount of cocaine people were charged with possessing before and after the passing of the Fair Sentencing Act in 2010, which raised the amount of cocaine necessary to trigger the 10-year mandatory minimum prison sentence for possession from 50g to 280g. There's a relatively smooth distribution before 2010 and a sharp discontinuity after 2010.

The bottom-left graph shows the sharp spike in prosecutions at 280 grams followed by what might be a drop in 2013 after evidentiary standards were changed1.

High school exit exam scores

This is a histogram of high school exit exam scores from the Polish language exam. We can see that a curiously high number of students score 30 or just above thirty while curiously low number of students score from 23-29. This is from 2013; other years I've looked at (2010-2012) show a similar discontinuity.

Math exit exam scores don't exhibit any unusual discontinuities in the years I've examined (2010-2013).

An anonymous reddit commenter explains this:

When a teacher is grading matura (final HS exam), he/she doesn't know whose test it is. The only things that are known are: the number (code) of the student and the district which matura comes from (it is usually from completely different part of Poland). The system is made to prevent any kind of manipulation, for example from time to time teachers supervisor will come to check if test are graded correctly. I don't wanna talk much about system flaws (and advantages), it is well known in every education system in the world where final tests are made, but you have to keep in mind that there is a key, which teachers follow very strictly when grading.

So, when a score of the test is below 30%, exam is failed. However, before making final statement in protocol, a commision of 3 (I don't remember exact number) is checking test again. This is the moment, where difference between humanities and math is shown: teachers often try to find a one (or a few) missing points, so the test won't be failed, because it's a tragedy to this person, his school and somewhat fuss for the grading team. Finding a "missing" point is not that hard when you are grading writing or open questions, which is a case in polish language, but nearly impossible in math. So that's the reason why distribution of scores is so different.

As with p values, having a bright-line threshold, causes curious behavior. In this case, scoring below 30 on any subject (a 30 or above is required in every subject) and failing the exam has arbitrary negative effects for people, so teachers usually try to prevent people from failing if there's an easy way to do it, but a deeper root of the problem is the idea that it's necessary to produce a certification that's the discretization of a continuous score.

Birth month and sports

These are scatterplots of football (soccer) players in the UEFA Youth League. The x-axis on both of these plots is how old players are modulo the year, i.e., their birth month normalized from 0 to 1.

The graph on the left is a histogram, which shows that there is a very strong relationship between where a person's birth falls within the year and their odds of making a club at the UEFA Youth League (U19) level. The graph on the right purports to show that birth time is only weakly correlated with actual value provided on the field. The authors use playing time as a proxy for value, presumably because it's easy to measure. That's not a great measure, but the result they find (younger-within-the-year players have higher value, conditional on making the U19 league) is consistent with other studies on sports and discrimination, which ind (for example) that black baseball players were significantly better than white baseball players for decades after desegregation in baseball, French-Canadian defensemen are also better than average (French-Canadians are stereotypically afraid to fight, don't work hard enough, and are too focused on offense).

The discontinuity isn't directly shown in the graphs above because the graphs only show birth date for one year. If we were to plot birth date by cohort across multiple years, we'd expect to see a sawtooth pattern in the probability that a player makes it into the UEFA youth league with a 10x difference between someone born one day before vs. after the threshold.

This phenomenon, that birth day or month is a good predictor of participation in higher-level youth sports as well as pro sports, has been studied across a variety of sports.

It's generally believed that this is caused by a discontinuity in youth sports:

  1. Kids are bucketed into groups by age in years and compete against people in the same year
  2. Within a given year, older kids are stronger, faster, etc., and perform better
  3. This causes older-within-year kids to outcompete younger kids, which later results in older-within-year kids having higher levels of participation for a variety of reasons

This is arguably a "bug" in how youth sports works. But as we've seen in baseball as well as a survey of multiple sports, obviously bad decision making that costs individual teams tens or even hundreds of millions of dollars can persist for decades in the face of people pubicly discussing how bad the decisions are. In this case, the youth sports teams aren't feeder teams to pro teams, so they don't have a financial incentive to select players who are skilled for their age (as opposed to just taller and faster because they're slightly older) so this system-wide non-optimal even more difficult to fix than pro sports teams making locally non-optimal decisions that are completely under their control.

Procurement auctions

Kawai et al. looked at Japanese government procurement, in order to find suspicious pattern of bids like the ones described in Porter et al. (1993), which looked at collusion in procurement auctions on Long Island (in New York in the United States). One example that's given is:

In February 1983, the New York State Department of Transportation (DoT) held a pro- curement auction for resurfacing 0.8 miles of road. The lowest bid in the auction was $4 million, and the DoT decided not to award the contract because the bid was deemed too high relative to its own cost estimates. The project was put up for a reauction in May 1983 in which all the bidders from the initial auction participated. The lowest bid in the reauction was 20% higher than in the initial auction, submitted by the previous low bidder. Again, the contract was not awarded. The DoT held a third auction in February 1984, with the same set of bidders as in the initial auction. The lowest bid in the third auction was 10% higher than the second time, again submitted by the same bidder. The DoT apparently thought this was suspicious: “It is notable that the same firm submitted the low bid in each of the auctions. Because of the unusual bidding patterns, the contract was not awarded through 1987.”

It could be argued that this is expected because different firms have different cost structures, so the lowest bidder in an auction for one particular project should be expected to be the lowest bidder in subsequent auctions for the same project. In order to distinguish between collusion and real structural cost differences between firms, Kawai et al. (2015) looked at auctions where the difference in bid between the first and second place firms was very small, making the winner effectively random.

In the auction structure studied, bidders submit a secret bid. If the secret bid is above a secret minimum, then the lowest bidder wins the auction and gets the contract. If not, the lowest bid is revealed to all bidders and another round of bidding is done. Kawai et al. found that, in about 97% of auctions, the bidder who submitted the lowest bid in the first round also submitted the lowest bid in the second round (the probability that the second lowest bidder remains second lowest was 26%).

Below, is a histogram of the difference in first and second round bids between the first-lowest and second-lowest bidders (left column) and the second-lowest and third-lowest bidders (right column). Each row has a different filtering criteria for how close the auction has to be in order to be included. In the top row, all auctions that reached the third round were included; in second, and third rows, the normalized delta between the first and second biders was less than 0.05 and 0.01, respectively; in the last row, the normalized delta between the first and the third bidder was less than 0.03. All numbers are normalized because the absolute size of auctions can vary.

We can see that the distributions of deltas between the first and second round are roughly symmetrical when comparing second and third lowest bidders. But when comparing first and second lowest bidders, there's a sharp discontinuity at zero, indicating that second-lowest bidder almost never lowers their bid by more than the first-lower bidder did. If you read the paper, you can see that the same structure persists into auctions that go into a third round.

I don't mean to pick on Japanese procurement auctions in particular. There's an extensive literature on procurement auctions that's found collusion in many cases, often much more blatant than the case presented above (e.g., there are a few firms and they round-robin who wins across auctions, or there are a handful of firms and every firm except for the winner puts in the same losing bid).

Restaurant inspection scores

The histograms below show a sharp discontinuity between 13 and 14, which is the difference between an A grade and a B grade. It appears that some regions also have a discontinuity between 27 and 28, which is the difference between a B and a C and this older analysis from 2014 found what appears to be a similar discontinuity between B and C grades.

Inspectors have discretion in what violations are tallied and it appears that there are cases where restaurant are nudged up to the next higher grade.

Marathon finishing times

A histogram of marathon finishing times (finish times on the x-axis, count on the y-axis) across 9,789,093 finishes shows noticeable discontinuities at every half hour, as well as at "round" times like :10, :15, and :20.

An analysis of times within each race (see section 4.4, figures 7-9) indicates that this is at least partially because people speed up (or slow down less than usual) towards the end of races if they're close to a "round" time2.

Notes

This post doesn't really have a goal or a point, it's just a collection of discontinuities that I find fun.

One thing that's maybe worth noting is that I've gotten a lot of mileage out in my career both out of being suspicious of discontinuities and figuring out where they come from and also out of applying standard techniques to smooth out discontinuities.

For finding discontinuities, basic tools like "drawing a scatterplot", "drawing a histogram", "drawing the CDF" often come in handy. Other kinds of visualizations that add temporality, like flamescope, can also come in handy.

We noted above that queues create a kind of discontinuity that, in some circumstances, should be smoothed out. We also noted that we see similar behavior for other kinds of thresholds and that randomization can be a useful tool to smooth out discontinuities in thresholds as well. Randomization can also be used to allow for reducing quantization error when reducing precision with ML and in other applications.

Thanks to Leah Hanson, Omar Rizwan, Dmitry Belenko, Kamal Marhubi, Danny Vilea, Nick Roberts, Lifan Zeng, Mark Ainsworth, Wesley Aptekar-Cassels, Thomas Hauk, @BaudDev, and Michael Sullivan for comments/corrections/discussion.

Also, please feel free to send me other interesting discontinuities!


  1. Most online commentary I've seen about this paper is incorrect. I've seen this paper used as evidence of police malfeasance because the amount of cocaine seized jumped to 280g. This is the opposite of what's described in the paper, where the author notes that, based on drug seizure records, amounts seized do not appear to be the cause of this change. After noting that drug seizures are not the cause, the author notes that prosecutors can charge people for amounts that are not the same as the amount seized and then notes:

    I do find bunching at 280g after 2010 in case management data from the Executive Office of the US Attorney (EOUSA). I also find that approximately 30% of prosecutors are responsible for the rise in cases with 280g after 2010, and that there is variation in prosecutor-level bunching both within and between districts. Prosecutors who bunch cases at 280g also have a high share of cases right above 28g after 2010 (the 5-year threshold post-2010) and a high share of cases above 50g prior to 2010 (the 10-year threshold pre-2010). Also, bunching above a mandatory minimum threshold persists across districts for prosecutors who switch districts. Moreover, when a “bunching” prosecutor switches into a new district, all other attorneys in that district increase their own bunching at mandatory minimums. These results suggest that the observed bunching at sentencing is specifically due to prosecutorial discretion

    This is mentioned in the abstract and then expounded on in the introduction (the quoted passage is from the introduction), so I think that most people commenting on this paper can't have read it. I've done a few surveys of comments on papers on blog posts and I generally find that, in cases where it's possible to identify this (e.g., when the post is mistitled), the vast majority of commenters can't have read the paper or post they're commenting on, but that's a topic for another post.

    There is some evidence that something fishy may be going on in seizures (e.g., see Fig. A8.(c)), but if the analysis in the paper is correct, that impact of that is much smaller than the impact of prosecutorial discretion.

    [return]
  2. One of the most common comments I've seen online about this graph and/or this paper is that this is due to pace runners provided by the marathon. Section 4.4 of the paper gives multiple explanations for why this cannot be the case, once again indicating that people tend to comment without reading the paper. [return]

95%-ile isn't that good

2020-02-07 08:00:00

Reaching 95%-ile isn't very impressive because it's not that hard to do. I think this is one of my most ridiculable ideas. It doesn't help that, when stated nakedly, that sounds elitist. But I think it's just the opposite: most people can become (relatively) good at most things.

Note that when I say 95%-ile, I mean 95%-ile among people who participate, not all people (for many activities, just doing it at all makes you 99%-ile or above across all people). I'm also not referring to 95%-ile among people who practice regularly. The "one weird trick" is that, for a lot of activities, being something like 10%-ile among people who practice can make you something like 90%-ile or 99%-ile among people who participate.

This post is going to refer to specifics since the discussions I've seen about this are all in the abstract, which turns them into Rorschach tests. For example, Scott Adams has a widely cited post claiming that it's better to be a generalist than a specialist because, to become "extraordinary", you have to either be "the best" at one thing or 75%-ile at two things. If that were strictly true, it would surely be better to be a generalist, but that's of course exaggeration and it's possible to get a lot of value out of a specialized skill without being "the best"; since the precise claim, as written, is obviously silly and the rest of the post is vague handwaving, discussions will inevitably devolve into people stating their prior beliefs and basically ignoring the content of the post.

Personally, in every activity I've participated in where it's possible to get a rough percentile ranking, people who are 95%-ile constantly make mistakes that seem like they should be easy to observe and correct. "Real world" activities typically can't be reduced to a percentile rating, but achieving what appears to be a similar level of proficiency seems similarly easy.

We'll start by looking at Overwatch (a video game) in detail because it's an activity I'm familiar with where it's easy to get ranking information and observe what's happening, and then we'll look at some "real world" examples where we can observe the same phenomena, although we won't be able to get ranking information for real world examples1.

Overwatch

At 90%-ile and 95%-ile ranks in Overwatch, the vast majority of players will pretty much constantly make basic game losing mistakes. These are simple mistakes like standing next to the objective instead of on top of the objective while the match timer runs out, turning a probable victory into a certain defeat. See the attached footnote if you want enough detail about specific mistakes that you can decide for yourself if a mistake is "basic" or not2.

Some reasons we might expect this to happen are:

  1. People don't want to win or don't care about winning
  2. People understand their mistakes but haven't put in enough time to fix them
  3. People are untalented
  4. People don't understand how to spot their mistakes and fix them

In Overwatch, you may see a lot of (1), people who don’t seem to care about winning, at lower ranks, but by the time you get to 30%-ile, it's common to see people indicate their desire to win in various ways, such as yelling at players who are perceived as uncaring about victory or unskilled, complaining about people who they perceive to make mistakes that prevented their team from winning, etc.3. Other than the occasional troll, it's not unreasonable to think that people are generally trying to win when they're severely angered by losing.

(2), not having put in time enough to fix their mistakes will, at some point, apply to all players who are improving, but if you look at the median time played at 50%-ile, people who are stably ranked there have put in hundreds of hours (and the median time played at higher ranks is higher). Given how simple the mistakes we're discussing are, not having put in enough time cannot be the case for most players.

A common complaint among low-ranked Overwatch players in Overwatch forums is that they're just not talented and can never get better. Most people probably don't have the talent to play in a professional league regardless of their practice regimen, but when you can get to 95%-ile by fixing mistakes like "not realizing that you should stand on the objective", you don't really need a lot of talent to get to 95%-ile.

While (4), people not understanding how to spot and fix their mistakes, isn't the only other possible explanation4, I believe it's the most likely explanation for most players. Most players who express frustration that they're stuck at a rank up to maybe 95%-ile or 99%-ile don't seem to realize that they could drastically improve by observing their own gameplay or having someone else look at their gameplay.

One thing that's curious about this is that Overwatch makes it easy to spot basic mistakes (compared to most other activities). After you're killed, the game shows you how you died from the viewpoint of the player who killed you, allowing you to see what led to your death. Overwatch also records the entire game and lets you watch a replay of the game, allowing you to figure out what happened and why the game was won or lost. In many other games, you'd have to set up recording software to be able to view a replay.

If you read Overwatch forums, you'll see a regular stream of posts that are basically "I'm SOOOOOO FRUSTRATED! I've played this game for 1200 hours and I'm still ranked 10%-ile, [some Overwatch specific stuff that will vary from player to player]". Another user will inevitably respond with something like "we can't tell what's wrong from your text, please post a video of your gameplay". In the cases where the original poster responds with a recording of their play, people will post helpful feedback that will immediately make the player much better if they take it seriously. If you follow these people who ask for help, you'll often see them ask for feedback at a much higher rank (e.g., moving from 10%-ile to 40%-ile) shortly afterwards. It's nice to see that the advice works, but it's unfortunate that so many players don't realize that watching their own recordings or posting recordings for feedback could have saved 1198 hours of frustration.

It appears to be common for Overwatch players (well into 95%-ile and above) to:

Overwatch provides the tools to make it relatively easy to get feedback, but people who very strongly express a desire to improve don't avail themselves of these tools.

Real life

My experience is that other games are similar and I think that "real life" activities aren't so different, although there are some complications.

One complication is that real life activities tend not to have a single, one-dimensional, objective to optimize for. Another is that what makes someone good at a real life activity tends to be poorly understood (by comparison to games and sports) even in relation to a specific, well defined, goal.

Games with rating systems are easy to optimize for: your meta-goal can be to get a high rating, which can typically be achieved by increasing your win rate by fixing the kinds of mistakes described above, like not realizing that you should step onto the objective. For any particular mistake, you can even make a reasonable guess at the impact on your win rate and therefore the impact on your rating.

In real life, if you want to be (for example) "a good speaker", that might mean that you want to give informative talks that help people learn or that you want to give entertaining talks that people enjoy or that you want to give keynotes at prestigious conferences or that you want to be asked to give talks for $50k an appearance. Those are all different objectives, with different strategies for achieving them and for some particular mistake (e.g., spending 8 minutes on introducing yourself during a 20 minute talk), it's unclear what that means with respect to your goal.

Another thing that makes games, at least mainstream ones, easy to optimize for is that they tend to have a lot of aficionados who have obsessively tried to figure out what's effective. This means that if you want to improve, unless you're trying to be among the top in the world, you can simply figure out what resources have worked for other people, pick one up, read/watch it, and then practice. For example, if you want to be 99%-ile in a trick-taking card game like bridge or spades (among all players, not subgroups like "ACBL players with masterpoints" or "people who regularly attend North American Bridge Championships"), you can do this by:

If you want to become a good speaker and you have a specific definition of “a good speaker” in mind, there still isn't an obvious path forward. Great speakers will give directly contradictory advice (e.g., avoid focusing on presentation skills vs. practice presentation skills). Relatively few people obsessively try to improve and figure out what works, which results in a lack of rigorous curricula for improving. However, this also means that it's easy to improve in percentile terms since relatively few people are trying to improve at all.

Despite all of the caveats above, my belief is that it's easier to become relatively good at real life activities relative to games or sports because there's so little delibrate practice put into most real life activities. Just for example, if you're a local table tennis hotshot who can beat every rando at a local bar, when you challenge someone to a game and they say "sure, what's your rating?" you know you're in for a shellacking by someone who can probably beat you while playing with a shoe brush (an actual feat that happened to a friend of mine, BTW). You're probably 99%-ile, but someone with no talent who's put in the time to practice the basics is going to have a serve that you can't return as well as be able to kill any shot a local bar expert is able to consitently hit. In most real life activities, there's almost no one who puts in a level of delibrate practice equivalent to someone who goes down to their local table tennis club and practices two hours a week, let alone someone like a top pro, who might seriously train for four hours a day.

To give a couple of concrete examples, I helped Leah prepare for talks from 2013 to 2017. The first couple practice talks she gave were about the same as you'd expect if you walked into a random talk at a large tech conference. For the first couple years she was speaking, she did something like 30 or so practice runs for each public talk, of which I watched and gave feedback on half. Her first public talk was (IMO) well above average for a talk at a large, well regarded, tech conference and her talks got better from there until she stopped speaking in 2017.

As we discussed above, this is more subjective than game ratings and there's no way to really determine a percentile, but if you look at how most people prepare for talks, it's not too surprising that Leah was above average. At one of the first conferences she spoke at, the night before the conference, we talked to another speaker who mentioned that they hadn't finished their talk yet and only had fifteen minutes of material (for a forty minute talk). They were trying to figure out how to fill the rest of the time. That kind of preparation isn't unusual and the vast majority of talks prepared like that aren't great.

Most people consider doing 30 practice runs for a talk to be absurd, a totally obsessive amount of practice, but I think Gary Bernhardt has it right when he says that, if you're giving a 30-minute talk to a 300 person audience, that's 150 person-hours watching your talk, so it's not obviously unreasonable to spend 15 hours practicing (and 30 practice runs will probably be less than 15 hours since you can cut a number of the runs short and/or repeatedly practice problem sections). One thing to note that this level of practice, considered obessive when giving a talk, still pales in comparison to the amount of time a middling table tennis club player will spend practicing.

If you've studied pedagogy, you might say that the help I gave to Leah was incredibly lame. It's known that having laypeople try to figure out how to improve among themselves is among the worst possible ways to learn something, good instruction is more effective and having a skilled coach or teacher give one-on-one instruction is more effective still5. That's 100% true, my help was incredibly lame. However, most people aren't going to practice a talk more than a couple times and many won't even practice a single time (I don't have great data proving this, this is from informally polling speakers at conferences I've attended). This makes Leah's 30 practice runs an extraordinary amount of practice compared to most speakers, which resulted in a relatively good outcome even though we were using one of the worst possible techniques for improvement.

My writing is another example. I'm not going to compare myself to anyone else, but my writing improved dramatically the first couple of years I wrote this blog just because I spent a little bit of effort on getting and taking feedback.

Leah read one or two drafts of almost every post and gave me feedback. On the first posts, since neither one of us knew anything about writing, we had a hard time identifying what was wrong. If I had some awkward prose or confusing narrative structure, we'd be able to point at it and say "that looks wrong" without being able to describe what was wrong or suggest a fix. It was like, in the era before spellcheck, when you misspelled a word and could tell that something was wrong, but every permutation you came up with was just as wrong.

My fix for that was to hire a professional editor whose writing I respected with the instructions "I don't care about spelling and grammar fixes, there are fundamental problems with my writing that I don't understand, please explain to me what they are"6. I think this was more effective than my helping Leah with talks because we got someone who's basically a professional coach involved. An example of something my editor helped us with was giving us a vocabulary we could use to discuss structural problems, the way design patterns gave people a vocabulary to talk about OO design.

Back to this blog's regularly scheduled topic: programming

Programming is similar to the real life examples above in that it's impossible to assign a rating or calculate percentiles or anything like that, but it is still possible to make significant improvements relative to your former self without too much effort by getting feedback on what you're doing.

For example, here's one thing Michael Malis did:

One incredibly useful exercise I’ve found is to watch myself program. Throughout the week, I have a program running in the background that records my screen. At the end of the week, I’ll watch a few segments from the previous week. Usually I will watch the times that felt like it took a lot longer to complete some task than it should have. While watching them, I’ll pay attention to specifically where the time went and figure out what I could have done better. When I first did this, I was really surprised at where all of my time was going.

For example, previously when writing code, I would write all my code for a new feature up front and then test all of the code collectively. When testing code this way, I would have to isolate which function the bug was in and then debug that individual function. After watching a recording of myself writing code, I realized I was spending about a quarter of the total time implementing the feature tracking down which functions the bugs were in! This was completely non-obvious to me and I wouldn’t have found it out without recording myself. Now that I’m aware that I spent so much time isolating which function a bugs are in, I now test each function as I write it to make sure they work. This allows me to write code a lot faster as it dramatically reduces the amount of time it takes to debug my code.

In the past, I've spent time figuring out where time is going when I code and basically saw the same thing as in Overwatch, except instead of constantly making game-losing mistakes, I was constantly doing pointlessly time-losing things. Just getting rid of some of those bad habits has probably been at least a 2x productivity increase for me, pretty easy to measure since fixing these is basically just clawing back wasted time. For example, I noticed how I'd get distracted for N minutes if I read something on the internet when I needed to wait for two minutes, so I made sure to keep a queue of useful work to fill dead time (and if I was working on something very latency sensitive where I didn't want to task switch, I'd do nothing until I was done waiting).

One thing to note here is that it's important to actually track what you're doing and not just guess at what you're doing. When I've recorded what people do and compare it to what they think they're doing, these are often quite different. It would generally be considered absurd to operate a complex software system without metrics or tracing, but it's normal to operate yourself without metrics or tracing, even though you're much more complex and harder to understand than the software you work on.

Jonathan Tang has noted that choosing the right problem dominates execution speed. I don't disagree with that, but doubling execution speed is still decent win that's independent of selecting the right problem to work on and I don't think that discussing how to choose the right problem can be effectively described in the abstract and the context necessary to give examples would be much longer than the already too long Overwatch examples in this post, maybe I'll write another post that's just about that.

Anyway, this is sort of an odd post for me to write since I think that culturally, we care a bit too much about productivity in the U.S., especially in places I've lived recently (NYC & SF). But at a personal level, higher productivity doing work or chores doesn't have to be converted into more work or chores, it can also be converted into more vacation time or more time doing whatever you value.

And for games like Overwatch, I don't think improving is a moral imperative; there's nothing wrong with having fun at 50%-ile or 10%-ile or any rank. But in every game I've played with a rating and/or league/tournament system, a lot of people get really upset and unhappy when they lose even when they haven't put much effort into improving. If that's the case, why not put a little bit of effort into improving and spend a little bit less time being upset?

Some meta-techniques for improving

Of course, these aren't novel ideas, e.g., Kotov's series of books from the 70s, Think like a Grandmaster, Play Like a Grandmaster, Train Like a Grandmaster covered these same ideas because these are some of the most obvious ways to improve.

Appendix: other most ridiculable ideas

Here are the ideas I've posted about that were the most widely ridiculed at the time of the post:

My posts on compensation have the dubious distinction of being the posts most frequently called out both for being so obvious that they're pointless as well as for being laughably wrong. I suspect they're also the posts that have had the largest aggregate impact on people -- I've had a double digit number of people tell me one of the compensation posts changed their life and they now make $x00,000/yr more than they used to because they know it's possible to get a much higher paying job and I doubt that I even hear from 10% of the people who make a big change as a result of learning that it's possible to make a lot more money.

When I wrote my first post on compensation, in 2015, I got ridiculed more for writing something obviously wrong than for writing something obvious, but the last few years things have flipped around. I still get the occasional bit of ridicule for being wrong when some corner of Twitter or a web forum that's well outside the HN/reddit bubble runs across my post, but the ratio of “obviously wrong” to “obvious” has probably gone from 20:1 to 1:5.

Opinions on monorepos have also seen a similar change since 2015. Outside of some folks at big companies, monorepos used to be considered obviously stupid among people who keep up with trends, but this has really changed. Not as much as opinions on compensation, but enough that I'm now a little surprised when I meet a hardline anti-monorepo-er.

Although it's taken longer for opinions to come around on CPU bugs, that's probably the post that now gets the least ridicule from the list above.

That markets don't eliminate all discrimination is the one where opinions have come around the least. Hardline "all markets are efficient" folks aren't really convinced by academic work like Becker's The Economics of Discrimination or summaries like the evidence laid out in the post.

The posts on computers having higher latency and the lack of empirical evidence of the benefit of types are the posts I've seen pointed to the most often to defend a ridiculable opinion. I didn't know when I started doing the work for either post and they both happen to have turned up evidence that's the opposite of the most common loud claims (there's very good evidence that advanced type systems improve safety in practice and of course computers are faster in every way, people who think they're slower are just indulging in nostalgia). I don't know if this has changed many opinion. However, I haven't gotten much direct ridicule for either post even though both posts directly state a position I see commonly ridiculed online. I suspect that's partially because both posts are empirical, so there's not much to dispute (though the post on discrimnation is also empirical, but it still gets its share of ridicule).

The last idea in the list is more meta: no one directly tells me that I should use more obscure terminology. Instead, I get comments that I must not know much about X because I'm not using terms of art. Using terms of art is a common way to establish credibility or authority, but that's something I don't really believe in. Arguing from authority doesn't tell you anything; adding needless terminology just makes things more difficult for readers who aren't in the field and are reading because they're interested in the topic but don't want to actually get into the field.

This is a pretty fundamental disagreement that I have with a lot of people. Just for example, I recently got into a discussion with an authority who insisted that it wasn't possible for me to reasonably disagree with them (I suggested we agree to disagree) because they're an authority on the topic and I'm not. It happens that I worked on the formal verification of a system very similar to the system we were discussing, but I didn't mention that because I don't believe that my status as an authority on the topic matters. If someone has such a weak argument that they have to fall back on an infallible authority, that's usually a sign that they don't have a well-reasoned defense of their position. This goes double when they point to themselves as the infallible authority.

I have about 20 other posts on stupid sounding ideas queued up in my head, but I mostly try to avoid writing things that are controversial, so I don't know that I'll write many of those up. If I were to write one post a month (much more frequently than my recent rate) and limit myself to 10% posts on ridiculable ideas, it would take 16 years to write up all of the ridiculable ideas I currently have.

Appendix: commentary on improvement

Thanks to Leah Hanson, Hillel Wayne, Robert Schuessler, Michael Malis, Kevin Burke, Jeremie Jost, Pierre-Yves Baccou, Veit Heller, Jeff Fowler, Malte Skarupe, David Turner, Akiva Leffert, Lifan Zeng, John Hergenroder, Wesley Aptekar-Cassels, Chris Lample, Julia Evans, Anja Boskovic, Vaibhav Sagar, Sean Talts, Emil Sit, Ben Kuhn, Valentin Hartmann, Sean Barrett, Kevin Shannon, Enzo Ferey, Andrew McCollum, Yuri Vishnevsky, and an anonymous commenter for comments/corrections/discussion.


  1. The choice of Overwatch is arbitrary among activities I'm familiar with where:

    • I know enough about the activity to comment on it
    • I've observed enough people trying to learn it that I can say if it's "easy" or not to fix some mistake or class of mistake
    • There's a large enough set of rated players to support the argument
    • Many readers will also be familiar with the activity

    99% of my gaming background comes from 90s video games, but I'm not going to use those as examples because relatively few readers will be familiar with those games. I could also use "modern" board games like Puerto Rico, Dominion, Terra Mystica, ASL etc., but the set of people who played in rated games is very low, which makes the argument less convincing (perhaps people who play in rated games are much worse than people who don't — unlikely, but difficult to justify without comparing gameplay between rated and unrated games, which is pretty deep into weeds for this post).

    There are numerous activities that would be better to use than Overwatch, but I'm not familiar enough with them to use them as examples. For example, on reading a draft of this post, Kevin Burke noted that he's observed the same thing while coaching youth basketball and multiple readers noted that they've observed the same thing in chess, but I'm not familiar enough with youth basketball or chess to confidently say much about either activity even they'd be better examples because it's likely that more readers are familiar with basketball or chess than with Overwatch.

    [return]
  2. When I first started playing Overwatch (which is when I did that experiment), I ended up getting rated slightly above 50%-ile (for Overwatch players, that was in Plat -- this post is going to use percentiles and not ranks to avoid making non-Overwatch players have to learn what the ranks mean). It's generally believed and probably true that people who play the main ranked game mode in Overwatch are, on average, better than people who only play unranked modes, so it's likely that my actual percentile was somewhat higher than 50%-ile and that all "true" percentiles listed in this post are higher than the nominal percentiles.

    Some things you'll regularly see at slightly above 50%-ile are:

    • Supports (healers) will heal someone who's at full health (which does nothing) while a teammate who's next to them is dying and then dies
    • Players will not notice someone who walks directly behind the team and kills people one at a time until the entire team is killed
    • Players will shoot an enemy until only one more shot is required to kill the enemy and then switch to a different target, letting the 1-health enemy heal back to full health before shooting at that enemy again
    • After dying, players will not wait for their team to respawn and will, instead, run directly into the enemy team to fight them 1v6. This will repeat for the entire game (the game is designed to be 6v6, but in ranks below 95%-ile, it's rare to see a 6v6 engagement after one person on one team dies)
    • Players will clearly have no idea what character abilities do, including for the character they're playing
    • Players go for very high risk but low reward plays (for Overwatch players, a classic example of this is Rein going for a meme pin when the game opens on 2CP defense, very common at 50%-ile, rare at 95%-ile since players who think this move is a good idea tend to have generally poor decision making).
    • People will have terrible aim and will miss four or five shots in a row when all they need to do is hit someone once to kill them
    • If a single flanking enemy threatens a healer who can't escape plus a non-healer with an escape ability, the non-healer will probably use their ability to run away, leaving the healer to die, even though they could easily kill the flanker and save their healer if they just attacked while being healed.

    Having just one aspect of your gameplay be merely bad instead of atrocious is enough to get to 50%-ile. For me, that was my teamwork, for others, it's other parts of their gameplay. The reason I'd say that my teamwork was bad and not good or even mediocre was that I basically didn't know how to play the game, didn't know what any of the characters’ strengths, weaknesses, and abilities are, so I couldn't possibly coordinate effectively with my team. I also didn't know how the game modes actually worked (e.g., under what circumstances the game will end in a tie vs. going into another round), so I was basically wandering around randomly with a preference towards staying near the largest group of teammates I could find. That's above average.

    You could say that someone is pretty good at the game since they're above average. But in a non-relative sense, being slightly above average is quite bad -- it's hard to argue that someone who doesn't notice their entire team being killed from behind while two teammates are yelling "[enemy] behind us!" over voice comms isn't bad.

    After playing a bit more, I ended up with what looks like a "true" rank of about 90%-ile when I'm using a character I know how to use. Due to volatility in ranking as well as matchmaking, I played in games as high as 98%-ile. My aim and dodging were still atrocious. Relative to my rank, my aim was actually worse than when I was playing in 50%-ile games since my opponents were much better and I was only a little bit better. In 90%-ile, two copies of myself would probably lose fighting against most people 2v1 in the open. I would also usually lose a fight if the opponent was in the open and I was behind cover such that only 10% of my character was exposed, so my aim was arguably more than 10x worse than median at my rank.

    My "trick" for getting to 90%-ile despite being a 1/10th aimer was learning how the game worked and playing in a way that maximized the probability of winning (to the best of my ability), as opposed to playing the game like it's an FFA game where your goal is to get kills as quickly as possible. It takes a bit more context to describe what this means in 90%-ile, so I'll only provide a couple examples, but these are representative of mistakes the vast majority of 90%-ile players are making all of the time (with the exception of a few players who have grossly defective aim, like myself, who make up for their poor aim by playing better than average for the rank in other ways).

    Within the game, the goal of the game is to win. There are different game modes, but for the mainline ranked game, they all will involve some kind of objective that you have to be on or near. It's very common to get into a situation where the round timer is ticking down to zero and your team is guaranteed to lose if no one on your team touches the objective but your team may win if someone can touch the objective and not die instantly (which will cause the game to go into overtime until shortly after both teams stop touching the objective). A concrete example of this that happens somewhat regularly is, the enemy team has four players on the objective while your team has two players near the objective, one tank and one support/healer. The other four players on your team died and are coming back from spawn. They're close enough that if you can touch the objective and not instantly die, they'll arrive and probably take the objective for the win, but they won't get there in time if you die immediately after taking the objective, in which case you'll lose.

    If you're playing the support/healer at 90%-ile to 95%-ile, this game will almost always end as follows: the tank will move towards the objective, get shot, decide they don't want to take damage, and then back off from the objective. As a support, you have a small health pool and will die instantly if you touch the objective because the other team will shoot you. Since your team is guaranteed to lose if you don't move up to the objective, you're forced to do so to have any chance of winning. After you're killed, the tank will either move onto the objective and die or walk towards the objective but not get there before time runs out. Either way, you'll probably lose.

    If the tank did their job and moved onto the objective before you died, you could heal the tank for long enough that the rest of your team will arrive and you'll probably win. The enemy team, if they were coordinated, could walk around or through the tank to kill you, but they won't do that -- anyone who knows that will cause them to win the game and can aim well enough to successfully follow through can't help but end up in a higher rank). And the hypothetical tank on your team who knows that it's their job to absorb damage for their support in that situation and not vice versa won't stay at 95%-ile very long because they'll win too many games and move up to a higher rank.

    Another basic situation that the vast majority of 90%-ile to 95%-ile players will get wrong is when you're on offense, waiting for your team to respawn so you can attack as a group. Even at 90%-ile, maybe 1/4 to 1/3 of players won't do this and will just run directly at the enemy team, but enough players will realize that 1v6 isn't a good idea that you'll often 5v6 or 6v6 fights instead of the constant 1v6 and 2v6 fights you see at 50%-ile. Anyway, while waiting for the team to respawn in order to get a 5v6, it's very likely one player who realizes that they shouldn't just run into the middle of the enemy team 1v6 will decide they should try to hit the enemy team with long-ranged attacks 1v6. People will do this instead of hiding in safety behind a wall even when the enemy has multiple snipers with instant-kill long range attacks. People will even do this against multiple snipers when they're playing a character that isn't a sniper and needs to hit the enemy 2-3 times to get a kill, making it overwhelmingly likely that they won't get a kill while taking a significant risk of dying themselves. For Overwatch players, people will also do this when they have full ult charge and the other team doesn't, turning a situation that should be to your advantage (your team has ults ready and the other team has used ults), into a neutral situation (both teams have ults) at best, and instantly losing the fight at worst.

    If you ever read an Overwatch forum, whether that's one of the reddit forums or the official Blizzard forums, a common complaint is "why are my teammates so bad? I'm at [90%-ile to 95%-ile rank], but all my teammates are doing obviously stupid game-losing things all the time, like [an example above]". The answer is, of course, that the person asking the question is also doing obviously stupid game-losing things all the time because anyone who doesn't constantly make major blunders wins too much to stay at 95%-ile. This also applies to me.

    People will argue that players at this rank should be good because they're better than 95% of other players, which makes them relatively good. But non-relatively, it's hard to argue that someone who doesn't realize that you should step on the objective to probably win the game instead of not touching the objective for a sure loss is good. One of the most basic things about Overwatch is that it's an objective-based game, but the majority of players at 90%-ile to 95%-ile don't play that way.

    For anyone who isn't well into the 99%-ile, reviewing recorded games will reveal game-losing mistakes all the time. For myself, usually ranked 90%-ile or so, watching a recorded game will reveal tens of game losing mistakes in a close game (which is maybe 30% of losses, the other 70% are blowouts where there isn't a single simple mistake that decides the game).

    It's generally not too hard to fix these since the mistakes are like the example above: simple enough that once you see that you're making the mistake, the fix is straightforward because the mistake is straightforward.

    [return]
  3. There are probably some people who just want to be angry at their teammates. Due to how infrequently you get matched with the same players, it's hard to see this in the main rated game mode, but I think you can sometimes see this when Overwatch sometimes runs mini-rated modes.

    Mini-rated modes have a much smaller playerbase than the main rated mode, which has two notable side effects: players with a much wider variety of skill levels will be thrown into the same game and you'll see the same players over and over again if you play multiple games.

    Since you ended up matched with the same players repeatedly, you'll see players make the same mistakes and cause themselves to lose in the same way and then have the same tantrum and blame their teammates in the same way game after game.

    You'll also see tantrums and teammate blaming in the normal rated game mode, but when you see it, you generally can't tell if the person who's having a tantrum is just having a bad day or if it's some other one-off occurrence since, unless you're ranked very high or very low (where there's a smaller pool of closely rated players), you don't run into the same players all that frequently. But when you see a set of players in 15-20 games over the course of a few weeks and you see them lose the game for the same reason a double digit number of times followed by the exact same tantrum, you might start to suspect that some fraction of those people really want to be angry and that the main thing they're getting out of playing the game is a source of anger. You might also wonder about this from how some people use social media, but that's a topic for another post.

    [return]
  4. For example, there will also be players who have some kind of disability that prevents them from improving, but at the levels we're talking about, 99%-ile or below, that will be relatively rare (certainly well under 50%, and I think it's not unreasonable to guess that it's well under 10% of people who choose to play the game). IIRC, there's at least one player who's in the top 500 who's deaf (this is severely disadvantageous since sound cues give a lot of fine-grained positional information that cannot be obtained in any other way), at least one legally blind player who's 99%-ile, and multiple players with physical impairments that prevent them from having fine-grained control of a mouse, i.e., who are basically incapable of aiming, who are 99%-ile.

    There are also other kinds of reasons people might not improve. For example, Kevin Burke has noted that when he coaches youth basketball, some children don't want to do drills that they think make them look foolish (e.g., avoiding learning to dribble with their off hand even during drills where everyone is dribbling poorly because they're using their off hand). When I spent a lot of time in a climbing gym with a world class coach who would regularly send a bunch of kids to nationals and some to worlds, I'd observe the same thing in his classes -- kids, even ones who are nationally or internationally competitive, would sometimes avoid doing things because they were afraid it would make them look foolish to their peers. The coach's solution in those cases was to deliberately make the kid look extremely foolish and tell them that it's better to look stupid now than at nationals.

    [return]
  5. note that, here, a skilled coach is someone who is skilled at coaching, not necessarily someone who is skilled at the activity. People who are skilled at the activity but who haven't explicitly been taught how to teach or spent a lot of time working on teaching are generally poor coaches. [return]
  6. If you read the acknowledgements section of any of my posts, you can see that I get feedback from more than just two people on most posts (and I really appreciate the feedback), but I think that, by volume, well over 90% of the feedback I've gotten has come from Leah and a professional editor. [return]

Algorithms interviews: theory vs. practice

2020-01-05 08:00:00

When I ask people at trendy big tech companies why algorithms quizzes are mandatory, the most common answer I get is something like "we have so much scale, we can't afford to have someone accidentally write an O(n^2) algorithm and bring the site down"1. One thing I find funny about this is, even though a decent fraction of the value I've provided for companies has been solving phone-screen level algorithms problems on the job, I can't pass algorithms interviews! When I say that, people often think I mean that I fail half my interviews or something. It's more than half.

When I wrote a draft blog post of my interview experiences, draft readers panned it as too boring and repetitive because I'd failed too many interviews. I should summarize my failures as a table because no one's going to want to read a 10k word blog post that's just a series of failures, they said (which is good advice; I'm working on a version with a table). I’ve done maybe 40-ish "real" software interviews and passed maybe one or two of them (arguably zero)2.

Let's look at a few examples to make it clear what I mean by "phone-screen level algorithms problem", above.

At one big company I worked for, a team wrote a core library that implemented a resizable array for its own purposes. On each resize that overflowed the array's backing store, the implementation added a constant number of elements and then copied the old array to the newly allocated, slightly larger, array. This is a classic example of how not to implement a resizable array since it results in linear time resizing instead of amortized constant time resizing. It's such a classic example that it's often used as the canonical example when demonstrating amortized analysis.

For people who aren't used to big tech company phone screens, typical phone screens that I've received are one of:

This array implementation problem is considered to be so easy that it falls into the "very easy" category and is either a warm-up for the "real" phone screen question or is bundled up with a bunch of similarly easy questions. And yet, this resizable array was responsible for roughly 1% of all GC pressure across all JVM code at the company (it was the second largest source of allocations across all code) as well as a significant fraction of CPU. Luckily, the resizable array implementation wasn't used as a generic resizable array and it was only instantiated by a semi-special-purpose wrapper, which is what allowed this to "only" be responsible for 1% of all GC pressure at the company. If asked as an interview question, it's overwhelmingly likely that most members of the team would've implemented this correctly in an interview. My fixing this made my employer more money annually than I've made in my life.

That was the second largest source of allocations, the number one largest source was converting a pair of long values to byte arrays in the same core library. It appears that this was done because someone wrote or copy pasted a hash function that took a byte array as input, then modified it to take two inputs by taking two byte arrays and operating on them in sequence, which left the hash function interface as (byte[], byte[]). In order to call this function on two longs, they used a handy long to byte[] conversion function in a widely used utility library. That function, in addition to allocating a byte[] and stuffing a long into it, also reverses the endianness of the long (the function appears to have been intended to convert long values to network byte order).

Unfortunately, switching to a more appropriate hash function would've been a major change, so my fix for this was to change the hash function interface to take a pair of longs instead of a pair of byte arrays and have the hash function do the endianness reversal instead of doing it as a separate step (since the hash function was already shuffling bytes around, this didn't create additional work). Removing these unnecessary allocations made my employer more money annually than I've made in my life.

Finding a constant factor speedup isn't technically an algorithms question, but it's also something you see in algorithms interviews. As a follow-up to an algorithms question, I commonly get asked "can you make this faster?" The answer is to these often involves doing a simple optimization that will result in a constant factor improvement.

A concrete example that I've been asked twice in interviews is: you're storing IDs as ints, but you already have some context in the question that lets you know that the IDs are densely packed, so you can store them as a bitfield instead. The difference between the bitfield interview question and the real-world superfluous array is that the real-world existing solution is so far afield from the expected answer that you probably wouldn’t be asked to find a constant factor speedup. More likely, you would've failed the interview at that point.

To pick an example from another company, the configuration for BitFunnel, a search index used in Bing, is another example of an interview-level algorithms question3.

The full context necessary to describe the solution is a bit much for this blog post, but basically, there's a set of bloom filters that needs to be configured. One way to do this (which I'm told was being done) is to write a black-box optimization function that uses gradient descent to try to find an optimal solution. I'm told this always resulted in some strange properties and the output configuration always resulted in non-idealities which were worked around by making the backing bloom filters less dense, i.e. throwing more resources (and therefore money) at the problem.

To create a more optimized solution, you can observe that the fundamental operation in BitFunnel is equivalent to multiplying probabilities together, so, for any particular configuration, you can just multiply some probabilities together to determine how a configuration will perform. Since the configuration space isn't all that large, you can then put this inside a few for loops and iterate over the space of possible configurations and then pick out the best set of configurations. This isn't quite right because multiplying probabilities assumes a kind of independence that doesn't hold in reality, but that seems to work ok for the same reason that naive Bayesian spam filtering worked pretty well when it was introduced even though it incorrectly assumes the probability of any two words appearing in an email are independent. And if you want the full solution, you can work out the non-independent details, although that's probably beyond the scope of an interview.

Those are just three examples that came to mind, I run into this kind of thing all the time and could come up with tens of examples off the top of my head, perhaps more than a hundred if I sat down and tried to list every example I've worked on, certainly more than a hundred if I list examples I know of that someone else (or no one) has worked on. Both the examples in this post as well as the ones I haven’t included have these properties:

At the start of this post, we noted that people at big tech companies commonly claim that they have to do algorithms interviews since it's so costly to have inefficiencies at scale. My experience is that these examples are legion at every company I've worked for that does algorithms interviews. Trying to get people to solve algorithms problems on the job by asking algorithms questions in interviews doesn't work.

One reason is that even though big companies try to make sure that the people they hire can solve algorithms puzzles they also incentivize many or most developers to avoid deploying that kind of reasoning to make money.

Of the three solutions for the examples above, two are in production and one isn't. That's about my normal hit rate if I go to a random team with a diff and don't persistently follow up (as opposed to a team that I have reason to believe will be receptive, or a team that's asked for help, or if I keep pestering a team until the fix gets taken).

If you're very cynical, you could argue that it's surprising the success rate is that high. If I go to a random team, it's overwhelmingly likely that efficiency is in neither the team's objectives or their org's objectives. The company is likely to have spent a decent amount of effort incentivizing teams to hit their objectives -- what's the point of having objectives otherwise? Accepting my diff will require them to test, integrate, deploy the change and will create risk (because all deployments have non-zero risk). Basically, I'm asking teams to do some work and take on some risk to do something that's worthless to them. Despite incentives, people will usually take the diff, but they're not very likely to spend a lot of their own spare time trying to find efficiency improvements(and their normal work time will be spent on things that are aligned with the team's objectives)4.

Hypothetically, let's say a company didn't try to ensure that its developers could pass algorithms quizzes but did incentivize developers to use relatively efficient algorithms. I don't think any of the three examples above could have survived, undiscovered, for years nor could they have remained unfixed. Some hypothetical developer working at a company where people profile their code would likely have looked at the hottest items in the profile for the most computationally intensive library at the company. The "trick" for both isn't any kind of algorithms wizardry, it's just looking at all, which is something incentives can fix. The third example is less inevitable since there isn't a standard tool that will tell you to look at the problem. It would also be easy to try to spin the result as some kind of wizardry -- that example formed the core part of a paper that won "best paper award" at the top conference in its field (IR), but the reality is that the "trick" was applying high school math, which means the real trick was having enough time to look at places where high school math might be applicable to find one.

I actually worked at a company that used the strategy of "don't ask algorithms questions in interviews, but do incentivize things that are globally good for the company". During my time there, I only found one single fix that nearly meets the criteria for the examples above (if the company had more scale, it would've met all of the criteria, but due to the company's size, increases in efficiency were worth much less than at big companies -- much more than I was making at the time, but the annual return was still less than my total lifetime earnings to date).

I think the main reason that I only found one near-example is that enough people viewed making the company better as their job, so straightforward high-value fixes tended not exist because systems were usually designed such that they didn't really have easy to spot improvements in the first place. In the rare instances where that wasn't the case, there were enough people who were trying to do the right thing for the company (instead of being forced into obeying local incentives that are quite different from what's globally beneficial to the company) that someone else was probably going to fix the issue before I ever ran into it.

The algorithms/coding part of that company's interview (initial screen plus onsite combined) was easier than the phone screen at major tech companies and we basically didn't do a system design interview.

For a while, we tried an algorithmic onsite interview question that was on the hard side but in the normal range of what you might see in a BigCo phone screen (but still easier than you'd expect to see at an onsite interview). We stopped asking the question because every new grad we interviewed failed the question (we didn't give experienced candidates that kind of question). We simply weren't prestigious enough to get candidates who can easily answer those questions, so it was impossible to hire using the same trendy hiring filters that everybody else had. In contemporary discussions on interviews, what we did is often called "lowering the bar", but it's unclear to me why we should care how high of a bar someone can jump over when little (and in some cases none) of the job they're being hired to do involves jumping over bars. And, in the cases where you do want them to jump over bars, they're maybe 2" high and can easily be walked over.

When measured on actual productivity, that was the most productive company I've worked for. I believe the reasons for that are cultural and too complex to fully explore in this post, but I think it helped that we didn't filter out perfectly good candidates with algorithms quizzes and assumed people could pick that stuff up on the job if we had a culture of people generally doing the right thing instead of focusing on local objectives.

If other companies want people to solve interview-level algorithms problems on the job perhaps they could try incentivizing people to solve algorithms problems (when relevant). That could be done in addition to or even instead of filtering for people who can whiteboard algorithms problems.

Appendix: how did we get here?

Way back in the day, interviews often involved "trivia" questions. Modern versions of these might look like the following:

I heard about this practice back when I was in school and even saw it with some "old school" companies. This was back when Microsoft was the biggest game in town and people who wanted to copy a successful company were likely to copy Microsoft. The most widely read programming blogger at the time (Joel Spolsky) was telling people they need to adopt software practice X because Microsoft was doing it and they couldn't compete without adopting the same practices. For example, in one of the most influential programming blog posts of the era, Joel Spolsky advocates for what he called the Joel test in part by saying that you have to do these things to keep up with companies like Microsoft:

A score of 12 is perfect, 11 is tolerable, but 10 or lower and you’ve got serious problems. The truth is that most software organizations are running with a score of 2 or 3, and they need serious help, because companies like Microsoft run at 12 full-time.

At the time, popular lore was that Microsoft asked people questions like the following (and I was actually asked one of these brainteasers during my on interview with Microsoft around 2001, along with precisely zero algorithms or coding questions):

Since I was interviewing during the era when this change was happening, I got asked plenty of trivia questions as well plenty of brainteasers (including all of the above brainteasers). Some other questions that aren't technically brainteasers that were popular at the time were Fermi problems. Another trend at the time was for behavioral interviews and a number of companies I interviewed with had 100% behavioral interviews with zero technical interviews.

Anyway, back then, people needed a rationalization for copying Microsoft-style interviews. When I asked people why they thought brainteasers or Fermi questions were good, the convenient rationalization people told me was usually that they tell you if a candidate can really think, unlike those silly trivia questions, which only tell if you people have memorized some trivia. What we really need to hire are candidates who can really think!

Looking back, people now realize that this wasn't effective and cargo culting Microsoft's every decision won't make you as successful as Microsoft because Microsoft's success came down to a few key things plus network effects, so copying how they interview can't possibly turn you into Microsoft. Instead, it's going to turn you into a company that interviews like Microsoft but isn't in a position to take advantage of the network effects that Microsoft was able to take advantage of.

For interviewees, the process with brainteasers was basically as it is now with algorithms questions, except that you'd review How Would You Move Mount Fuji before interviews instead of Cracking the Coding Interview to pick up a bunch of brainteaser knowledge that you'll never use on the job instead of algorithms knowledge you'll never use on the job.

Back then, interviewers would learn about questions specifically from interview prep books like "How Would You Move Mount Fuji?" and then ask them to candidates who learned the answers from books like "How Would You Move Mount Fuji?". When I talk to people who are ten years younger than me, they think this is ridiculous -- those questions obviously have nothing to do the job and being able to answer them well is much more strongly correlated with having done some interview prep than being competent at the job. Hillel Wayne has discussed how people come up with interview questions today (and I've also seen it firsthand at a few different companies) and, outside of groups that are testing for knowledge that's considered specialized, it doesn't seem all that different today.

At this point, we've gone through a few decades of programming interview fads, each one of which looks ridiculous in retrospect. Either we've finally found the real secret to interviewing effectively and have reasoned our way past whatever roadblocks were causing everybody in the past to use obviously bogus fad interview techniques, or we're in the middle of another fad, one which will seem equally ridiculous to people looking back a decade or two from now.

Without knowing anything about the effectiveness of interviews, at a meta level, since the way people get interview techniques is the same (crib the high-level technique from the most prestigious company around), I think it would be pretty surprising if this wasn't a fad. I would be less surprised to discover that current techniques were not a fad if people were doing or referring to empirical research or had independently discovered what works.

Inspired by a comment by Wesley Aptekar-Cassels, the last time I was looking for work, I asked some people how they checked the effectiveness of their interview process and how they tried to reduce bias in their process. The answers I got (grouped together when similar, in decreasing order of frequency were):

Appendix: training

As with most real world problems, when trying to figure out why seven, eight, or even nine figure per year interview-level algorithms bugs are lying around waiting to be fixed, there isn't a single "root cause" you can point to. Instead, there's a kind of hedgehog defense of misaligned incentives. Another part of this is that training is woefully underappreciated.

We've discussed that, at all but one company I've worked for, there are incentive systems in place that cause developers to feel like they shouldn't spend time looking at efficiency gains even when a simple calculation shows that there are tens or hundreds of millions of dollars in waste that could easily be fixed. And then because this isn't incentivized, developers tend to not have experience doing this kind of thing, making it unfamiliar, which makes it feel harder than it is. So even when a day of work could return $1m/yr in savings or profit (quite common at large companies, in my experience), people don't realize that it's only a day of work and could be done with only a small compromise to velocity. One way to solve this latter problem is with training, but that's even harder to get credit for than efficiency gains that aren't in your objectives!

Just for example, I once wrote a moderate length tutorial (4500 words, shorter than this post by word count, though probably longer if you add images) on how to find various inefficiencies (how to use an allocation or CPU time profiler, how to do service-specific GC tuning for the GCs we use, how to use some tooling I built that will automatically find inefficiencies in your JVM or container configs, etc., basically things that are simple and often high impact that it's easy to write a runbook for; if you're at Twitter, you can read this at http://go/easy-perf). I've had a couple people who would've previously come to me for help with an issue tell me that they were able to debug and fix an issue on their own and, secondhand, I heard that a couple other people who I don't know were able to go off and increase the efficiency of their service. I'd be surprised if I’ve heard about even 10% of cases where this tutorial helped someone, so I'd guess that this has helped tens of engineers, and possibly quite a few more.

If I'd spent a week doing "real" work instead of writing a tutorial, I'd have something concrete, with quantifiable value, that I could easily put into a promo packet or performance review. Instead, I have this nebulous thing that, at best, counts as a bit of "extra credit". I'm not complaining about this in particular -- this is exactly the outcome I expected. But, on average, companies get what they incentivize. If they expect training to come from developers (as opposed to hiring people to produce training materials, which tends to be very poorly funded compared to engineering) but don't value it as much as they value dev work, then there's going to be a shortage of training.

I believe you can also see training under-incentivized in public educational materials due to the relative difficulty of monetizing education and training. If you want to monetize explaining things, there are a few techniques that seem to work very well. If it's something that's directly obviously valuable, selling a video course that's priced "very high" (hundreds or thousands of dollars for a short course) seems to work. Doing corporate training, where companies fly you in to talk to a room of 30 people and you charge $3k per head also works pretty well.

If you want to reach (and potentially help) a lot of people, putting text on the internet and giving it away works pretty well, but monetization for that works poorly. For technical topics, I'm not sure the non-ad-blocking audience is really large enough to monetize via ads (as opposed to a pay wall).

Just for example, Julia Evans can support herself from her zine income, which she's said has brought in roughly $100k/yr for the past two years. Someone who does very well in corporate training can pull that in with a one or two day training course and, from what I've heard of corporate speaking rates, some highly paid tech speakers can pull that in with two engagements. Those are significantly above average rates, especially for speaking engagements, but since we're comparing to Julia Evans, I don't think it's unfair to use an above average rate.

Appendix: misaligned incentive hedgehog defense, part 3

Of the three examples above, I found one on a team where it was clearly worth zero to me to do anything that was actually valuable to the company and the other two on a team where it valuable to me to do things that were good for the company, regardless of what they were. In my experience, that's very unusual for a team at a big company, but even on that team, incentive alignment was still quite poor. At one point, after getting a promotion and a raise, I computed the ratio of the amount of money my changes made the company vs. my raise and found that my raise was 0.03% of the money that I made the company, only counting easily quantifiable and totally indisputable impact to the bottom line. The vast majority of my work was related to tooling that had a difficult to quantify value that I suspect was actually larger than the value of the quantifiable impact, so I probably received well under 0.01% of the marginal value I was producing. And that's really an overestimate of how much I was incentivized I was to do the work -- at the margin, I strongly suspect that anything I did was worth zero to me. After the first $10m/yr or maybe $20m/yr, there's basically no difference in terms of performance reviews, promotions, raises, etc. Because there was no upside to doing work and there's some downside (could get into a political fight, could bring the site down, etc.), the marginal return to me of doing more than "enough" work was probably negative.

Some companies will give very large out-of-band bonuses to people regularly, but that work wasn't for a company that does a lot of that, so there's nothing the company could do to indicate that it valued additional work once someone did "enough" work to get the best possible rating on a performance review. From a mechanism design point of view, the company was basically asking employees to stop working once they did "enough" work for the year.

So even on this team, which was relatively well aligned with the company's success compared to most teams, the company's compensation system imposed a low ceiling on how well the team could be aligned.

This also happened in another way. As is common at a lot of companies, managers were given a team-wide budget for raises that was mainly a function of headcount, that was then doled out to team members in a zero-sum way. Unfortunately for each team member (at least in terms of compensation), the team pretty much only had productive engineers, meaning that no one was going to do particularly well in the zero-sum raise game. The team had very low turnover because people like working with good co-workers, but the company was applying one the biggest levers it has, compensation, to try to get people to leave the team and join less effective teams.

Because this is such a common setup, I've heard of managers at multiple companies who try to retain people who are harmless but ineffective to try to work around this problem. If you were to ask someone, abstractly, if the company wants to hire and retain people who are ineffective, I suspect they'd tell you no. But insofar as a company can be said to want anything, it wants what it incentivizes.

Thanks to Leah Hanson, Heath Borders, Lifan Zeng, Justin Findlay, Kevin Burke, @chordowl, Peter Alexander, Niels Olson, Kris Shamloo, Chip Thien, Yuri Vishnevsky, and Solomon Boulos for comments/corrections/discussion


  1. For one thing, most companies that copy the Google interview don't have that much scale. But even for companies that do, most people don't have jobs where they're designing high-scale algorithms (maybe they did at Google circa 2003, but from what I've seen at three different big tech companies, most people's jobs are pretty light on algorithms work). [return]
  2. Real is in quotes because I've passed a number of interviews for reasons outside of the interview process. Maybe I had a very strong internal recommendation that could override my interview performance, maybe someone read my blog and assumed that I can do reasonable work based on my writing, maybe someone got a backchannel reference from a former co-worker of mine, or maybe someone read some of my open source code and judged me on that instead of a whiteboard coding question (and as far as I know, that last one has only happened once or twice). I'll usually ask why I got a job offer in cases where I pretty clearly failed the technical interview, so I have a collection of these reasons from folks.

    The reason it's arguably zero is that the only software interview where I inarguably got a "real" interview and was coming in cold was at Google, but that only happened because the interviewers that were assigned interviewed me for the wrong ladder -- I was interviewing for a hardware position, but I was being interviewed by software folks, so I got what was basically a standard software interview except that one interviewer asked me some questions about state machine and cache coherence (or something like that). After they realized that they'd interviewed me for the wrong ladder, I had a follow-up phone interview from a hardware engineer to make sure I wasn't totally faking having worked at a hardware startup from 2005 to 2013. It's possible that I failed the software part of the interview and was basically hired on the strength of the follow-up phone screen.

    Note that this refers only to software -- I'm actually pretty good at hardware interviews. At this point, I'm pretty out of practice at hardware and would probably need a fair amount of time to ramp up on an actual hardware job, but the interviews are a piece of cake for me. One person who knows me pretty well thinks this is because I "talk like a hardware engineer" and both say things that make hardware folks think I'm legit as well as say things that sound incredibly stupid to most programmers in a way that's more about shibboleths than actual knowledge or skills.

    [return]
  3. This one is a bit harder than you'd expect to get in a phone screen, but it wouldn't be out of line in an onsite interview (although a friend of mine once got a Google Code Jam World Finals question in a phone interview with Google, so you might get something this hard or harder, depending on who you draw as an interviewer).

    BTW, if you're wondering what my friend did when they got that question, it turns out they actually knew the answer because they'd seen and attempted the problem during Google Code Jam. They didn't get the right answer at the time, but they figured it out later just for fun. However, my friend didn't think it was reasonable to give that as a phone screen questions and asked the interviewer for another question. The interviewer refused, so my friend failed the phone screen. At the time, I doubt there were more than a few hundred people in the world who would've gotten the right answer to the question in a phone screen and almost all of them probably would've realized that it was an absurd phone screen question. After failing the interview, my friend ended up looking for work for almost six months before passing an interview for a startup where he ended up building a number of core systems (in terms of both business impact and engineering difficulty). My friend is still there after the mid 10-figure IPO -- the company understands how hard it would be to replace this person and treats them very well. None of the other companies that interviewed this person even wanted to hire them at all and they actually had a hard time getting a job.

    [return]
  4. Outside of egregious architectural issues that will simply cause a service to fall over, the most common way I see teams fix efficiency issues is to ask for more capacity. Some companies try to counterbalance this in some way (e.g., I've heard that at FB, a lot of the teams that work on efficiency improvements report into the capacity org, which gives them the ability to block capacity requests if they observe that a team has extreme inefficiencies that they refuse to fix), but I haven't personally worked in an environment where there's an effective system fix to this. Google had a system that was intended to address this problem that, among other things, involved making headcount fungible with compute resources, but I've heard that was rolled back in favor of a more traditional system for reasons. [return]

Files are fraught with peril

2019-07-12 08:00:00

This is a psuedo-transcript for a talk given at Deconstruct 2019. To make this accessible for people on slow connections as well as people using screen readers, the slides have been replaced by in-line text (the talk has ~120 slides; at an average of 20 kB per slide, that's 2.4 MB. If you think that's trivial, consider that half of Americans still aren't on broadband and the situation is much worse in developing countries.

Let's talk about files! Most developers seem to think that files are easy. Just for example, let's take a look at the top reddit r/programming comments from when Dropbox announced that they were only going to support ext4 on Linux (the most widely used Linux filesystem). For people not familiar with reddit r/programming, I suspect r/programming is the most widely read English language programming forum in the world.

The top comment reads:

I'm a bit confused, why do these applications have to support these file systems directly? Doesn't the kernel itself abstract away from having to know the lower level details of how the files themselves are stored?

The only differences I could possibly see between different file systems are file size limitations and permissions, but aren't most modern file systems about on par with each other?

The #2 comment (and the top replies going two levels down) are:

#2: Why does an application care what the filesystem is?

#2: Shouldn't that be abstracted as far as "normal apps" are concerned by the OS?

Reply: It's a leaky abstraction. I'm willing to bet each different FS has its own bugs and its own FS specific fixes in the dropbox codebase. More FS's means more testing to make sure everything works right . . .

2nd level reply: What are you talking about? This is a dropbox, what the hell does it need from the FS? There are dozenz of fssync tools, data transfer tools, distributed storage software, and everything works fine with inotify. What the hell does not work for dropbox exactly?

another 2nd level reply: Sure, but any bugs resulting from should be fixed in the respective abstraction layer, not by re-implementing the whole stack yourself. You shouldn't re-implement unless you don't get the data you need from the abstraction. . . . DropBox implementing FS-specific workarounds and quirks is way overkill. That's like vim providing keyboard-specific workarounds to avoid faulty keypresses. All abstractions are leaky - but if no one those abstractions, nothing will ever get done (and we'd have billions of "operating systems").

In this talk, we're going to look at how file systems differ from each other and other issues we might encounter when writing to files. We're going to look at the file "stack" starting at the top with the file API, which we'll see is nearly impossible to use correctly and that supporting multiple filesystems without corrupting data is much harder than supporting a single filesystem; move down to the filesystem, which we'll see has serious bugs that cause data loss and data corruption; and then we'll look at disks and see that disks can easily corrupt data at a rate five million times greater than claimed in vendor datasheets.

File API

Writing one file

Let's say we want to write a file safely, so that we don't want to get data corruption. For the purposes of this talk, this means we'd like our write to be "atomic" -- our write should either fully complete, or we should be able to undo the write and end up back where we started. Let's look at an example from Pillai et al., OSDI’14.

We have a file that contains the text a foo and we want to overwrite foo with bar so we end up with a bar. We're going to make a number of simplifications. For example, you should probably think of each character we're writing as a sector on disk (or, if you prefer, you can imagine we're using a hypothetical advanced NVM drive). Don't worry if you don't know what that means, I'm just pointing this out to note that this talk is going to contain many simplifications, which I'm not going to call out because we only have twenty-five minutes and the unsimplified version of this talk would probably take about three hours.

To write, we might use the pwrite syscall. This is a function provided by the operating system to let us interact with the filesystem. Our invocation of this syscall looks like:

pwrite(
  [file], 
  “bar”, // data to write
  3,     // write 3 bytes
  2)     // at offset 2

pwrite takes the file we're going to write, the data we want to write, bar, the number of bytes we want to write, 3, and the offset where we're going to start writing, 2. If you're used to using a high-level language, like Python, you might be used to an interface that looks different, but underneath the hood, when you write to a file, it's eventually going to result in a syscall like this one, which is what will actually write the data into a file.

If we just call pwrite like this, we might succeed and get a bar in the output, or we might end up doing nothing and getting a foo, or we might end up with something in between, like a boo, a bor, etc.

What's happening here is that we might crash or lose power when we write. Since pwrite isn't guaranteed to be atomic, if we crash, we can end up with some fraction of the write completing, causing data corruption. One way to avoid this problem is to store an "undo log" that will let us restore corrupted data. Before we're modify the file, we'll make a copy of the data that's going to be modified (into the undo log), then we'll modify the file as normal, and if nothing goes wrong, we'll delete the undo log.

If we crash while we're writing the undo log, that's fine -- we'll see that the undo log isn't complete and we know that we won't have to restore because we won't have started modifying the file yet. If we crash while we're modifying the file, that's also ok. When we try to restore from the crash, we'll see that the undo log is complete and we can use it to recover from data corruption:

creat(/d/log) // Create undo log
write(/d/log, "2,3,foo", 7) // To undo, at offset 2, write 3 bytes, "foo"
pwrite(/d/orig, “bar", 3, 2) // Modify original file as before
unlink(/d/log) // Delete log file

If we're using ext3 or ext4, widely used Linux filesystems, and we're using the mode data=journal (we'll talk about what these modes mean later), here are some possible outcomes we could get:

d/log: "2,3,f"
d/orig: "a foo"

d/log: ""
d/orig: "a foo"

It's possible we'll crash while the log file write is in progress and we'll have an incomplete log file. In the first case above, we know that the log file isn't complete because the file says we should start at offset 2 and write 3 bytes, but only one byte, f, is specified, so the log file must be incomplete. In the second case above, we can tell the log file is incomplete because the undo log format should start with an offset and a length, but we have neither. Either way, since we know that the log file isn't complete, we know that we don't need to restore.

Another possible outcome is something like:

d/log: "2,3,foo"
d/orig: "a boo"

d/log: "2,3,foo"
d/orig: "a bar"

In the first case, the log file is complete we crashed while writing the file. This is fine, since the log file tells us how to restore to a known good state. In the second case, the write completed, but since the log file hasn't been deleted yet, we'll restore from the log file.

If we're using ext3 or ext4 with data=ordered, we might see something like:

d/log: "2,3,fo"
d/orig: "a boo"

d/log: ""
d/orig: "a bor"

With data=ordered, there's no guarantee that the write to the log file and the pwrite that modifies the original file will execute in program order. Instesad, we could get

creat(/d/log) // Create undo log
pwrite(/d/orig, “bar", 3, 2) // Modify file before writing undo log!
write(/d/log, "2,3,foo", 7) // Write undo log
unlink(/d/log) // Delete log file

To prevent this re-ordering, we can use another syscall, fsync. fsync is a barrier (prevents re-ordering) and it flushes caches (which we'll talk about later).

creat(/d/log)
write(/d/log, “2,3,foo”, 7)
fsync(/d/log) // Add fsync to prevent re-ordering
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig) // Add fsync to prevent re-ordering
unlink(/d/log)

This works with ext3 or ext4, data=ordered, but if we use data=writeback, we might see something like:

d/log: "2,3,WAT"
d/orig: "a boo"

Unfortunately, with data=writeback, the write to the log file isn't guaranteed to be atomic and the filesystem metadata that tracks the file length can get updated before we've finished writing the log file, which will make it look like the log file contains whatever bits happened to be on disk where the log file was created. Since the log file exists, when we try to restore after a crash, we may end up "restoring" random garbage into the original file. To prevent this, we can add a checksum (a way of making sure the file is actually valid) to the log file.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7) // Add checksum to log file to detect incomplete log file
fsync(/d/log)
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

This should work with data=writeback, but we could still see the following:

d/orig: "a boo"

There's no log file! Although we created a file, wrote to it, and then fsync'd it. Unfortunately, there's no guarantee that the directory will actually store the location of the file if we crash. In order to make sure we can easily find the file when we restore from a crash, we need to fsync the parent of the newly created log.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7)
fsync(/d/log)
fsync(/d) /// fsync parent directory
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

There are a couple more things we should do. We shoud also fsync after we're done (not shown), and we also need to check for errors. These syscalls can return errors and those errors need to be handled appropriately. There's at least one filesystem issue that makes this very difficult, but since that's not an API usage thing per se, we'll look at this again in the Filesystems section.

We've now seen what we have to do to write a file safely. It might be more complicated than we like, but it seems doable -- if someone asks you to write a file in a self-contained way, like an interview question, and you know the appropriate rules, you can probably do it correctly. But what happens if we have to do this as a day-to-day part of our job, where we'd like to write to files safely every time to write to files in a large codebase.

API in practice

Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.

When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?

Concurrent programming is hard

There are a number of reasons for this. If you ask people "what are hard problems in programming?", you'll get answers like distributed systems, concurrent programming, security, aligning things with CSS, dates, etc.

And if we look at what mistakes cause bugs when people do concurrent programming, we see bugs come from things like "incorrectly assuming operations are atomic" and "incorrectly assuming operations will execute in program order". These things that make concurrent programming hard also make writing files safely hard -- we saw examples of both of these kinds of bugs in our first example. More generally, many of the same things that make concurrent programming hard are the same things that make writing to files safely hard, so of course we should expect that writing to files is hard!

Another property writing to files safely shares with concurrent programming is that it's easy to write code that has infrequent, non-deterministc failures. With respect to files, people will sometimes say this makes things easier ("I've never noticed data corruption", "your data is still mostly there most of the time", etc.), but if you want to write files safely because you're working on software that shouldn't corrupt data, this makes things more difficult by making it more difficult to tell if your code is really correct.

API inconsistent

As we saw in our first example, even when using one filesystem, different modes may have significantly different behavior. Large parts of the file API look like this, where behavior varies across filesystems or across different modes of the same filesystem. For example, if we look at mainstream filesystems, appends are atomic, except when using ext3 or ext4 with data=writeback, or ext2 in any mode and directory operations can't be re-ordered w.r.t. any other operations, except on btrfs. In theory, we should all read the POSIX spec carefully and make sure all our code is valid according to POSIX, but if they check filesystem behavior at all, people tend to code to what their filesystem does and not some abtract spec.

If we look at one particular mode of one filesystem (ext4 with data=journal), that seems relatively possible to handle safely, but when writing for a variety of filesystems, especially when handling filesystems that are very different from ext3 and ext4, like btrfs, it becomes very difficult for people to write correct code.

Docs unclear

In our first example, we saw that we can get different behavior from using different data= modes. If we look at the manpage (manual) on what these modes mean in ext3 or ext4, we get:

journal: All data is committed into the journal prior to being written into the main filesystem.

ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.

writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

If you want to know how to use your filesystem safely, and you don't already know what a journaling filesystem is, this definitely isn't going to help you. If you know what a journaling filesystem is, this will give you some hints but it's still not sufficient. It's theoretically possible to figure everything out from reading the source code, but this is pretty impractical for most people who don't already know how the filesystem works.

For English-language documentation, there's lwn.net and the Linux kernel mailing list (LKML). LWN is great, but they can't keep up with everything, so LKML is the place to go if you want something comprehensive. Here's an example of an exchange on LKML about filesystems:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
FS dev: Probably from some 6-8 years ago, in e-mail postings that I made.

While the filesystem developers tend to be helpful and they write up informative responses, most people probably don't keep up with the past 6-8 years of LKML.

Performance / correctness conflict

Another issue is that the file API has an inherent conflict between performance and correctness. We noted before that fsync is a barrier (which we can use to enforce ordering) and that it flushes caches. If you've ever worked on the design of a high-performance cache, like a microprocessor cache, you'll probably find the bundling of these two things into a single primitive to be unusual. A reason this is unusual is that flushing caches has a significant performance cost and there are many cases where we want to enforce ordering without paying this performance cost. Bundling these two things into a single primitive forces us to pay the cache flush cost when we only care about ordering.

Chidambaram et al., SOSP’13 looked at the performance cost of this by modifying ext4 to add a barrier mechanism that doesn't flush caches and they found that, if they modified software appropriately and used their barrier operation where a full fsync wasn't necessary, they were able to achieve performance roughly equivalent to ext4 with cache flushing entirely disabled (which is unsafe and can lead to data corruption) without sacrificing safety. However, making your own filesystem and getting it adopted is impractical for most people writing user-level software. Some databases will bypass the filesystem entirely or almost entirely, but this is also impractical for most software.

That's the file API. Now that we've seen that it's extraordinarily difficult to use, let's look at filesystems.

Filesystem

If we want to make sure that filessystems work, one of the most basic tests we could do is to inject errors are the layer below the filesystem to see if the filesystem handles them properly. For example, on a write, we could have the disk fail to write the data and return the appropriate error. If the filesystem drops this error or doesn't handle ths properly, that means we have data loss or data corruption. This is analogous to the kinds of distributed systems faults Kyle Kingsbury talked about in his distributed systems testing talk yesterday (although these kinds of errors are much more straightforward to test).

Prabhakaran et al., SOSP’05 did this and found that, for most filesystems tested, almost all write errors were dropped. The major exception to this was on ReiserFS, which did a pretty good job with all types of errors tested, but ReiserFS isn't really used today for reasons beyond the scope of this talk.

We (Wesley Aptekar-Cassels and I) looked at this again in 2017 and found that things had improved significantly. Most filesystems (other than JFS) could pass these very basic tests on error handling.

Another way to look for errors is to look at filesystems code to see if it handles internal errors correctly. Gunawai et al., FAST’08 did this and found that internal errors were dropped a significant percentage of the time. The technique they used made it difficult to tell if functions that could return many different errors were correctly handling each error, so they also looked at calls to functions that can only return a single error. In those cases, depending on the function, errors were dropped roughly 2/3 to 3/4 of the time, depending on the function.

Wesley and I also looked at this again in 2017 and found significant improvement -- errors for the same functions Gunawi et al. looked at were "only" ignored 1/3 to 2/3 of the time, depending on the function.

Gunawai et al. also looked at comments near these dropped errors and found comments like "Just ignore errors at this point. There is nothing we can do except to try to keep going." (XFS) and "Error, skip block and hope for the best." (ext3).

Now we've seen that while filesystems used to drop even the most basic errors, they now handle then correctly, but there are some code paths where errors can get dropped. For a concrete example of a case where this happens, let's look back at our first example. If we get an error on fsync, unless we have a pretty recent Linux kernel (Q2 2018-ish), there's a pretty good chance that the error will be dropped and it may even get reported to the wrong process!

On recent Linux kernels, there's a good chance the error will be reported (to the correct process, even). Wilcox, PGCon’18 notes that an error on fsync is basically unrecoverable. The details for depending on filesystem -- on XFS and btrfs, modified data that's in the filesystem will get thrown away and there's no way to recover. On ext4, the data isn't thrown away, but it's marked as unmodified, so the filesystem won't try to write it back to disk later, and if there's memory pressure, the data can be thrown out at any time. If you're feeling adventurous, you can try to recover the data before it gets thrown out with various tricks (e.g., by forcing the filesystem to mark it as modified again, or by writing it out to another device, which will force the filesystem to write the data out even though it's marked as unmodified), but there's no guarantee you'll be able to recover the data before it's thrown out. On Linux ZFS, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.

In general, there isn't a good way to recover from this on Linux. Postgres, MySQL, and MongoDB (widely used databases) will crash themselves and the user is expected to restore from the last checkpoint. Most software will probably just silently lose or corrupt data. And fsync is a relatively good case -- for example, syncfs simply doesn't return errors on Linux at all, leading to silent data loss and data corruption.

BTW, when Craig Ringer first proposed that Postgres should crash on fsync error, the first response on the Postgres dev mailing list was:

Surely you jest . . . If [current behavior of fsync] is actually the case, we need to push back on this kernel brain damage

But after talking through the details, everyone agreed that crashing was the only good option. One of the many unfortunate things is that most disk errors are transient. Since the filesystem discards critical information that's necessary to proceed without data corruption on any error, transient errors that could be retried instead force software to take drastic measures.

And while we've talked about Linux, this isn't unique to Linux. Fsync error handling (and error handling in general) is broken on many different operating systems. At the time Postgres "discovered" the behavior of fsync on Linux, FreeBSD had arguably correct behavior, but OpenBSD and NetBSD behaved the same as Linux (true error status dropped, retrying causes success response, data lost). This has been fixed on OpenBSD and probably some other BSDs, but Linux still basically has the same behavior and you don't have good guarantees that this will work on any random UNIX-like OS.

Now that we've seen that, for many years, filesystems failed to handle errors in some of the most straightforward and simple cases and that there are cases that still aren't handled correctly today, let's look at disks.

Disk

Flushing

We've seen that it's easy to not realize we have to call fsync when we have to call fsync, and that even if we call fsync appropriately, bugs may prevent fsync from actually working. Rajimwale et al., DSN’11 into whether or not disks actually flush when you ask them to flush, assuming everything above the disk works correctly (their paper is actually mostly about something else, they just discuss this briefly at the beginning). Someone from Microsoft anonymously told them "[Some disks] do not allow the file system to force writes to disk properly" and someone from Seagate, a disk manufacturer, told them "[Some disks (though none from us)] do not allow the file system to force writes to disk properly". Bairavasundaram et al., FAST’07 also found the same thing when they looked into disk reliability.

Error rates

We've seen that filessystems sometimes don't handle disk errors correctly. If we want to know how serious this issue is, we should look at the rate at which disks emit errors. Disk datasheets will usually an uncorrectable bit error rate of 1e-14 for consumer HDDs (often called spinning metal or spinning rust disks), 1e-15 for enterprise HDDs, 1e-15 for consumer SSDs, and 1e-16 for enterprise SSDs. This means that, on average, we expect to see one unrecoverable data error every 1e14 bits we read on an HDD.

To get an intuition for what this means in practice, 1TB is now a pretty normal disk size. If we read a full drive once, that's 1e12 bytes, or almost 1e13 bits (technically 8e12 bits), which means we should see, in expectation, one unrecoverable if we buy a 1TB HDD and read the entire disk ten-ish times. Nowadays, we can buy 10TB HDDs, in which case we'd expect to see an error (technically, 8/10th errors) on every read of an entire consumer HDD.

In practice, observed data rates are are significantly higher. Narayanan et al., SYSTOR’16 (Microsoft) observed SSD error rates from 1e-11 to 6e-14, depending on the drive model. Meza et al., SIGMETRICS’15 (FB) observed even worse SSD error rates, 2e-9 to 6e-11 depending on the model of drive. Depending on the type of drive, 2e-9 is 2 gigabits, or 250 MB, 500 thousand to 5 million times worse than stated on datasheets depending on the class of drive.

Bit error rate is arguably a bad metric for disk drives, but this is the metric disk vendors claim, so that's what we have to compare against if we want an apples-to-apples comparison. See Bairavasundaram et al., SIGMETRICS'07, Schroeder et al., FAST'16, and others for other kinds of error rates.

One thing to note is that it's often claimed that SSDs don't have problems with corruption because they use error correcting codes (ECC), which can fix data corruption issues. "Flash banishes the specter of the unrecoverable data error", etc. The thing this misses is that modern high-density flash devices are very unreliable and need ECC to be usable at all. Grupp et al., FAST’12 looked at error rates of the kind of flash the underlies SSDs and found errors rates from 1e-1 to 1e-8. 1e-1 is one error every ten bits, 1e-8 is one error every 100 megabits.

Power loss

Another claim you'll hear is that SSDs are safe against power loss and some types of crashes because they now have "power loss protection" -- there's some mechanism in the SSDs that can hold power for long enough during an outage that the internal SSD cache can be written out safely.

Luke Leighton tested this by buying 6 SSDs that claim to have power loss protection and found that four out of the six models of drive he tested failed (every drive that wasn't an Intel drive). If we look at the details of the tests, when drives fail, it appears to be because they were used in a way that the implementor of power loss protection didn't expect (writing "too fast", although well under the rate at which the drive is capable of writing, or writing "too many" files in parallel). When a drive advertises that it has power loss protection, this appears to mean that someone spent some amount of effort implementing something that will, under some circumstances, prevent data loss or data corruption under power loss. But, as we saw in Kyle's talk yesterday on distributed systems, if you want to make sure that the mechanism actually works, you can't rely on the vendor to do rigorous or perhaps even any semi-serious testing and you have to test it yourself.

Retention

If we look at SSD datasheets, a young-ish drive (one with 90% of its write cycles remaining) will usually be specced to hold data for about ten years after a write. If we look at a worn out drive, one very close to end-of-life, it's specced to retain data for one year to three months, depending on the class of drive. I think people are often surprised to find that it's within spec for a drive to lose data three months after the data is written.

These numbers all come from datasheets and specs, as we've seen, datasheets can be a bit optimistic. On many early SSDs, using up most or all of a drives write cycles would cause the drive to brick itself, so you wouldn't even get the spec'd three month data retention.

Corollaries

Now that we've seen that there are significant problems at every level of the file stack, let's look at a couple things that follow from this.

What to do?

What we should do about this is a big topic, in the time we have left, one thing we can do instead of writing to files is to use databases. If you want something lightweight and simple that you can use in most places you'd use a file, SQLite is pretty good. I'm not saying you should never use files. There is a tradeoff here. But if you have an application where you'd like to reduce the rate of data corruption, considering using a database to store data instead of using files.

FS support

At the start of this talk, we looked at this Dropbox example, where most people thought that there was no reason to remove support for most Linux filesystems because filesystems are all the same. I believe their hand was forced by the way they want to store/use data, which they can only do with ext given how they're doing things (which is arguably a mis-feature), but even if that wasn't the case, perhaps you can see why software that's attempting to sync data to disk reliably and with decent performance might not want to support every single filesystem in the universe for an OS that, for their product, is relatively niche. Maybe it's worth supporting every filesystem for PR reasons and then going through the contortions necessary to avoid data corruption on a per-filesystem basis (you can try coding straight to your reading of the POSIX spec, but as we've seen, that won't save you on Linux), but the PR problem is caused by a misunderstanding.

The other comment we looked at on reddit, and also a common sentiment, is that it's not a program's job to work around bugs in libraries or the OS. But user data gets corrupted regardless of who's "fault" the bug is, and as we've seen, bugs can persist in the filesystem layer for many years. In the case of Linux, most filesystems other than ZFS seem to have decided it's correct behavior to throw away data on fsync error and also not report that the data can't be written (as opposed to FreeBSD or OpenBSD, where most filesystems will at least report an error on subsequent fsyncs if the error isn't resolved). This is arguably a bug and also arguably correct behavior, but either way, if your software doesn't take this into account, you're going to lose or corrupt data. If you want to take the stance that it's not your fault that the filesystem is corrupting data, your users are going to pay the cost for that.

FAQ

While putting this talk to together, I read a bunch of different online discussions about how to write to files safely. For discussions outside of specialized communities (e.g., LKML, the Postgres mailing list, etc.), many people will drop by to say something like "why is everyone making this so complicated? You can do this very easily and completely safely with this one weird trick". Let's look at the most common "one weird trick"s from two thousand internet comments on how to write to disk safely.

Rename

The most frequently mentioned trick is to rename instead of overwriting. If you remember our single-file write example, we made a copy of the data that we wanted to overwrite before modifying the file. The trick here is to do the opposite:

  1. Make a copy of the entire file
  2. Modify the copy
  3. Rename the copy on top of the original file

This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that rename is atomic, but that only means rename is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't guaranteed to execute in program order, as people sometimes expect.

The most mainstream exception where rename is atomic on crash is probably btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, rename is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive testing, even if you're writing btrfs specific code.

And even if this worked, the performance of this technique is quite poor.

Append

The second most frequently mentioned trick is to only ever append (instead of sometimes overwriting). This also doesn't work. As noted in Pillai et al., OSDI’14 and Bornholt et al., ASPLOS’16, appends don't guarantee ordering or atomicity and believing that appends are safe is the cause of some bugs.

One weird tricks

We've seen that the most commonly cited simple tricks don't work. Something I find interesting is that, in these discussions, people will drop into a discussion where it's already been explained, often in great detail, why writing to files is harder than someone might naively think, ignore all warnings and explanations and still proceed with their explanation for why it's, in fact, really easy. Even when warned that files are harder than people think, people still think they're easy!

Conclusion

In conclusion, computers don't work (but you probably already know this if you're here at Gary-conf). This talk happened to be about files, but there are many areas we could've looked into where we would've seen similar things.

One thing I'd like to note before we finish is that, IMO, the underlying problem isn't technical. If you look at what huge tech companies do (companies like FB, Amazon, MS, Google, etc.), they often handle writes to disk pretty safely. They'll make sure that they have disks where power loss protection actually work, they'll have patches into the OS and/or other instrumentation to make sure that errors get reported correctly, there will be large distributed storage groups to make sure data is replicated safely, etc. We know how to make this stuff pretty reliable. It's hard, and it takes a lot of time and effort, i.e., a lot of money, but it can be done.

If you ask someone who works on that kind of thing why they spend mind boggling sums of money to ensure (or really, increase the probability of) correctness, you'll often get an answer like "we have a zillion machines and if you do the math on the rate of data corruption, if we didn't do all of this, we'd have data corruption every minute of every day. It would be totally untenable". A huge tech company might have, what, order of ten million machines? The funny thing is, if you do the math for how many consumer machines there are out there and much consumer software runs on unreliable disks, the math is similar. There are many more consumer machines; they're typically operated at much lighter load, but there are enough of them that, if you own a widely used piece of desktop/laptop/workstation software, the math on data corruption is pretty similar. Without "extreme" protections, we should expect to see data corruption all the time.

But if we look at how consumer software works, it's usually quite unsafe with respect to handling data. IMO, the key difference here is that when a huge tech company loses data, whether that's data on who's likely to click on which ads or user emails, the company pays the cost, directly or indirectly and the cost is large enough that it's obviously correct to spend a lot of effort to avoid data loss. But when consumers have data corruption on their own machines, they're mostly not sophisticated enough to know who's at fault, so the company can avoid taking the brunt of the blame. If we have a global optimization function, the math is the same -- of course we should put more effort into protecting data on consumer machines. But if we're a company that's locally optimizing for our own benefit, the math works out differently and maybe it's not worth it to spend a lot of effort on avoiding data corruption.

Yesterday, Ramsey Nasser gave a talk where he made a very compelling case that something was a serious problem, which was followed up by a comment that his proposed solution will have a hard time getting adoption. I agree with both parts -- he discussed an important problem, and it's not clear how solving that problem will make anyone a lot of money, so the problem is likely to go unsolved.

With GDPR, we've seen that regulation can force tech companies to protect people's privacy in a way they're not naturally inclined to do, but regulation is a very big hammer and the unintended consequences can often negate or more than negative the benefits of regulation. When we look at the history of regulations that are designed to force companies to do the right thing, we can see that it's often many years, sometimes decades, before the full impact of the regulation is understood. Designing good regulations is hard, much harder than any of the technical problems we've discussed today.

Acknowledgements

Thanks to Leah Hanson, Gary Bernhardt, Kamal Marhubi, Rebecca Isaacs, Jesse Luehrs, Tom Crayford, Wesley Aptekar-Cassels, Rose Ames, [email protected], and Benjamin Gilbert for their help with this talk!

Sorry we went so fast. If there's anything you missed you can catch it in the pseudo-transcript at danluu.com/deconstruct-files.

This "transcript" is pretty rough since I wrote it up very quickly this morning before the talk. I'll try to clean it within a few weeks, which will include adding material that was missed, inserting links, fixing typos, adding references that were missed, etc.

Thanks to Anatole Shaw, Jernej Simoncic, @junh1024, Yuri Vishnevsky, and Josh Duff for comments/corrections/discussion on this transcript.

Randomized trial on gender in Overwatch

2019-02-19 08:00:00

A recurring discussion in Overwatch (as well as other online games) is whether or not women are treated differently from men. If you do a quick search, you can find hundreds of discussions about this, some of which have well over a thousand comments. These discussions tend to go the same way and involve the same debate every time, with the same points being made on both sides. Just for example, these three threads on reddit that spun out of a single post that have a total of 10.4k comments. On one side, you have people saying "sure, women get trash talked, but I'm a dude and I get trash talked, everyone gets trash talked there's no difference", "I've never seen this, it can't be real", etc., and on the other side you have people saying things like "when I play with my boyfriend, I get accused of being carried by him all the time but the reverse never happens", "people regularly tell me I should play mercy[, a character that's a female healer]", and so on and so forth. In less time than has been spent on a single large discussion, we could just run the experiment, so here it is.

This is the result of playing 339 games in the two main game modes, quick play (QP) and competitive (comp), where roughly half the games were played with a masculine name (where the username was a generic term for a man) and half were played with a feminine name (where the username was a woman's name). I recorded all of the comments made in each of the games and then classified the comments by type. Classes of comments were "sexual/gendered comments", "being told how to play", "insults", and "compliments".

In each game that's included, I decided to include the game (or not) in the experiment before the character selection screen loaded. In games that were included, I used the same character selection algorithm, I wouldn't mute anyone for spamming chat or being a jerk, I didn't speak on voice chat (although I had it enabled), I never sent friend requests, and I was playing outside of a group in order to get matched with 5 random players. When playing normally, I might choose a character I don't know how to use well and I'll mute people who pollute chat with bad comments. There are a lot of games that weren't included in the experiment because I wasn't in a mood to listen to someone rage at their team for fifteen minutes and the procedure I used involved pre-committing to not muting people who do that.

Sexual or sexually charged comments

I thought I'd see more sexual comments when using the feminine name as opposed to the masculine name, but that turned out to not be the case. There was some mention of sex, genitals, etc., in both cases and the rate wasn't obviously different and was actually higher in the masculine condition.

Zero games featured comments were directed specifically at me in the masculine condition and two (out of 184) games in the feminine condition featured comments that were directed at me. Most comments were comments either directed at other players or just general comments to team or game chat.

Examples of typical undirected comments that would occur in either condition include "my girlfriend keeps sexting me how do I get her to stop?", "going in balls deep", "what a surprise. *strokes dick* [during the post-game highlight]", and "support your local boobies".

The two games that featured sexual comments directed at me had the following comments:

During games not included in the experiment (I generally didn't pay attention to which username I was on when not in the experiment), I also got comments like "send nudes". Anecdotally, there appears to be a difference in the rate of these kinds of comments directed at the player, but the rate observed in the experiment is so low that uncertainty intervals around any estimates of the true rate will be similar in both conditions unless we use a strong prior.

The fact that this difference couldn't be observed in 339 games was surprising to me, although it's not inconsistent with McDaniel's thesis, a survey of women who play video games. 339 games probably sounds like a small number to serious gamers, but the only other randomized experiment I know of on this topic (besides this experiment) is Kasumovic et al., which notes that "[w]e stopped at 163 [games] as this is a substantial time effort".

All of the analysis uses the number of games in which a type of comment occured and not tone to avoid having to code comments as having a certain tone in order to avoid possibly injecting bias into the process. Sentiment analysis models, even state-of-the-art ones often return nonsensical results, so this basically has to be done by hand, at least today. With much more data, some kind of sentiment analysis, done with liberal spot checking and re-training of the model, could work, but the total number of comments is so small in this case that it would amount to coding each comment by hand.

Coding comments manually in an unbiased fashion can also be done with a level of blinding, but doing that would probably require getting more people involved (since I see and hear comments while I'm playing) and relying on unpaid or poorly paid labor.

Being told how to play

The most striking, easy to quantify, difference was the rate at which I played games in which people told me how I should play. Since it's unclear how much confidence we should have in the difference if we just look at the raw rates, we'll use a simple statistical model to get the uncertainty interval around the estimates. Since I'm not sure what my belief about this should be, this uses an uninformative prior, so the estimate is close to the actual rate. Anyway, here are the uncertainty intervals a simple model puts on the percent of games where at least one person told me I was playing wrong, that I should change how I'm playing, or that I switch characters:

Cond Est P25 P75
F comp 19 13 25
M comp 6 2 10
F QP 4 3 6
M QP 1 0 2

The experimental conditions in this table are masculine vs. feminine name (M/F) and competitive mode vs quick play (comp/QP). The numbers are percents. Est is the estimate, P25 is the 25%-ile estimate, and P75 is the 75%-ile estimate. Competitive mode and using a feminine name are both correlated with being told how to play. See this post by Andrew Gelman for why you might want to look at the 50% interval instead of the 95% interval.

For people not familiar with overwatch, in competitive mode, you're explicitly told what your ELO-like rating is and you get a badge that reflects your rating. In quick play, you have a rating that's tracked, but it's never directly surfaced to the user and you don't get a badge.

It's generally believed that people are more on edge during competitive play and are more likely to lash out (and, for example, tell you how you should play). The data is consistent with this common belief.

Per above, I didn't want to code tone of messages to avoid bias, so this table only indicates the rate at which people told me I was playing incorrectly or asked that I switch to a different character. The qualitative difference in experience is understated by this table. For example, the one time someone asked me to switch characters in the masculine condition, the request was a one sentence, polite, request ("hey, we're dying too quickly, could we switch [from the standard one primary healer / one off healer setup] to double primary healer or switch our tank to [a tank that can block more damage]?"). When using the feminine name, a typical case would involve 1-4 people calling me human garbage for most of the game and consoling themselves with the idea that the entire reason our team is losing is that I won't change characters.

The simple model we're using indicates that there's probably a difference between both competitive and QP and playing with a masculine vs. a feminine name. However, most published results are pretty bogus, so let's look at reasons this result might be bogus and then you can decide for yourself.

Threats to validity

The biggest issue is that this wasn't a pre-registered trial. I'm obviously not going to go and officially register a trial like this, but I also didn't informally "register" this by having this comparison in mind when I started the experiment. A problem with non-pre-registered trials is that there are a lot of degrees of freedom, both in terms of what we could look at, and in terms of the methodology we used to look at things, so it's unclear if the result is "real" or an artifact of fishing for something that looks interesting. A standard example of this is that, if you look for 100 possible effects, you're likely to find 1 that appears to be statistically significant with p = 0.01.

There are standard techniques to correct for this problem (e.g., Bonferroni correction), but I don't find these convincing because they usually don't capture all of the degrees of freedom that go into a statistical model. An example is that it's common to take a variable and discretize it into a few buckets. There are many ways to do this and you generally won't see papers talk about the impact of this or correct for this in any way, although changing how these buckets are arranged can drastically change the results of a study. Another common knob people can use to manipulate results is curve fitting to an inappropriate curve (often a 2nd a 3rd degree polynomial when a scatterplot shows that's clearly incorrect). Another way to handle this would be to use a more complex model, but I wanted to keep this as simple as possible.

If I wanted to really be convinced on this, I'd want to, at a minimum, re-run this experiment with this exact comparison in mind. As a result, this experiment would need to be replicated to provide more than a preliminary result that is, at best, weak evidence.

One other large class of problem with randomized controlled trials (RCTs) is that, despite randomization, the two arms of the experiment might be different in some way that wasn't randomized. Since Overwatch doesn't allow you to keep changing your name, this experiment was done with two different accounts and these accounts had different ratings in competitive mode. On average, the masculine account had a higher rating due to starting with a higher rating, which meant that I was playing against stronger players and having worse games on the masculine account. In the long run, this will even out, but since most games in this experiment were in QP, this didn't have time to even out in comp. As a result, I had a higher win rate as well as just generally much better games with the feminine account in comp.

With no other information, we might expect that people who are playing worse get told how to play more frequently and people who are playing better should get told how to play less frequently, which would mean that the table above understates the actual difference.

However Kasumovic et al., in a gender-based randomized trial of Halo 3, found that players who were playing poorly were more negative towards women, especially women who were playing well (there's enough statistical manipulation of the data that a statement this concise can only be roughly correct, see study for details). If that result holds, it's possible that I would've gotten fewer people telling me that I'm human garbage and need to switch characters if I was average instead of dominating most of my games in the feminine condition.

If that result generalizes to OW, that would explain something which I thought was odd, which was that a lot of demands to switch and general vitriol came during my best performances with the feminine account. A typical example of this would be a game where we have a 2-2-2 team composition (2 players playing each of the three roles in the game) where my counterpart in the same role ran into the enemy team and died at the beginning of the fight in almost every engagement. I happened to be having a good day and dominated the other team (37-2 in a ten minute comp game, while focusing on protecting our team's healers) while only dying twice, once on purpose as a sacrifice and second time after a stupid blunder. Immediately after I died, someone asked me to switch roles so they could take over for me, but at no point did someone ask the other player in my role to switch despite their total uselessness all game (for OW players this was a Rein who immediately charged into the middle of the enemy team at every opportunity, from a range where our team could not possibly support them; this was Hanamura 2CP, where it's very easy for Rein to set up situations where their team cannot help them). This kind of performance was typical of games where my team jumped on me for playing incorrectly. This isn't to say I didn't have bad games; I had plenty of bad games, but a disproportionate number of the most toxic experiences came when I was having a great game.

I tracked how well I did in games, but this sample doesn't have enough ranty games to do a meaningful statistical analysis of my performance vs. probability of getting thrown under the bus.

Games at different ratings are probably also generally different environments and get different comments, but it's not clear if there are more negative comments at 2000 than 2500 or vice versa. There are a lot of online debates about this; for any rating level other than the very lowest or the very highest ratings, you can find a lot of people who say that the rating band they're in has the highest volume of toxic comments.

Other differences

Here are some things that happened while playing with the feminine name that didn't happen with the masculine name during this experiment or in any game outside of this experiment:

  • unsolicited "friend" requests from people I had no textual or verbal interaction with (happened 7 times total, didn't track which cases were in the experiment and which weren't)
  • someone on the other team deciding that my team wasn't doing a good enough job of protecting me while I was playing healer, berating my team, and then throwing the game so that we won (happened once during the experiment)
  • someone on my team flirting with me and then flipping out when I don't respond, who then spends the rest of the game calling me autistic or toxic (this happened once during the experiment, and once while playing in a game not included in the experiment)

The rate of all these was low enough that I'd have to play many more games to observe something without a huge uncertainty interval.

I didn't accept any friend requests from people I had no interaction with. Anecdotally, some people report people will send sexual comments or berate them after an unsolicited friend request. It's possible that the effect show in the table would be larger if I accepted these friend requests and it couldn't be smaller.

I didn't attempt to classify comments as flirty or not because, unlike the kinds of commments I did classify, this is often somewhat subtle and you could make a good case that any particular comment is or isn't flirting. Without responding (which I didn't do), many of these kinds of comments are ambiguous

Another difference was in the tone of the compliments. The rate of games where I was complimented wasn't too different, but compliments under the masculine condition tended to be short and factual (e.g., someone from the other team saying "no answer for [name of character I was playing]" after a dominant game) and compliments under the feminine condition tended to be more effusive and multiple people would sometimes chime in about how great I was.

Non differences

The rate of complements and the rate of insults in games that didn't include explanations of how I'm playing wrong or how I need to switch characters were similar in both conditions.

Other factors

Some other factors that would be interesting to look at would be time of day, server, playing solo or in a group, specific character choice, being more or less communicative, etc., but it would take a lot more data to be able to get good estimates when adding it more variables. Blizzard should have the data necessary to do analyses like this in aggregate, but they're notoriously private with their data, so someone at Blizzard would have to do the work and then publish it publicly, and they're not really in the habit of doing that kind of thing. If you work at Blizzard and are interested in letting a third party do some analysis on an anonymized data set, let me know and I'd be happy to dig in.

Experimental minutiae

Under both conditions, I avoided ever using voice chat and would call things out in text chat when time permitted. Also under both conditions, I mostly filled in with whatever character class the team needed most, although I'd sometimes pick DPS (in general, DPS are heavily oversubscribed, so you'll rarely play DPS if you don't pick one even when unnecessary).

For quickplay, backfill games weren't counted (backfill games are games where you join after the game started to fill in for a player who left; comp doesn't allow backfills). 6% of QP games were backfills.

These games are from before the "endorsements" patch; most games were played around May 2018. All games were played in "solo q" (with 5 random teammates). In order to avoid correlations between games depending on how long playing sessions were, I quit between games and waited for enough time (since you're otherwise likely to end up in a game with some or many of the same players as before).

The model used probability of a comment happening in a game to avoid the problem that Kasumovic et al. ran into, where a person who's ranting can skew the total number of comments. Kasumovic et al. addressed this by removing outliers, but I really don't like manually reaching in and removing data to adjust results. This could also be addressed by using a more sophisticated model, but a more sophisticated model means more knobs which means more ways for bias to sneak in. Using the number of players who made comments instead would be one way to mitigate this problem, but I think this still isn't ideal because these aren't independent -- when one player starts being negative, this greatly increases the odds that another player in that game will be negative, but just using the number of players makes four games with one negative person the same as one game with four negative people. This can also be accounted for with a slightly more sophisticated model, but that also involves adding more knobs to the model.

UPDATE: 98%-ile

One of the more common comments I got when I wrote this post is that it's only valid at "low" ratings, like Plat, which is 50%-ile. If someone is going to concede that a game's community is toxic at 50%-ile and you have to be significantly better than that to avoid toxic players, that seems to be conceding that the game's community is toxic.

However, to see if that's accurate, I played a bit more and play in games as high as 98%-ile to see if things improved. While there was a minor improvement, it's not fundamentally different at 98%-ile, so people who are saying that things are much better at higher ranks either have very different experiences than I did or are referring to 99%-ile or above. If it's the latter, then I'd say that the previous comment about conceding that the game has a toxic community holds. If it's the former, perhaps I just got unlucky, but based on other people's comments about their experiences with the game, I don't think I got particularly unlucky.

Appendix: comments / advice to overwatch players

A common complaint, perhaps the most common complaint by people below 2000 SR (roughly 30%-ile) or perhaps 1500 SR (roughly 10%-ile) is that they're in "ELO hell" and are kept down because their teammates are too bad. Based on my experience, I find this to be extremely unlikely.

People often split skill up into "mechanics" and "gamesense". My mechanics are pretty much as bad as it's possible to get. The last game I played seriously was a 90s video game that's basically online asteroids and the last game before that I put any time into was the original SNES super mario kart. As you'd expect from someone who hasn't put significant time into a post-90s video game or any kind of FPS game, my aim and dodging are both atrocious. On top of that, I'm an old dude with slow reflexes and I was able to get to 2500 SR (roughly 60%-ile among players who play "competitive", likely higher among all players) by avoiding a few basic fallacies and blunders despite have approximately zero mechanical skill. If you're also an old dude with basically no FPS experience, you can do the same thing; if you have good reflexes or enough FPS experience to actually aim or dodge, you basically can't be worse mechnically than I am and you can do much better by avoiding a few basic mistakes.

The most common fallacy I see repeated is that you have to play DPS to move out of bronze or gold. The evidence people give for this is that, when a GM streamer plays flex, tank, or healer, they sometimes lose in bronze. I guess the idea is that, because the only way to ensure a 99.9% win rate in bronze is to be a GM level DPS player and play DPS, the best way to maintain a 55% or a 60% win rate is to play DPS, but this doesn't follow.

Healers and tanks are both very powerful in low ranks. Because low ranks feature both poor coordination and relatively poor aim (players with good coordination or aim tend to move up quickly), time-to-kill is very slow compared to higher ranks. As a result, an off healer can tilt the result of a 1v1 (and sometimes even a 2v1) matchup and a primary healer can often determine the result of a 2v1 matchup. Because coordination is poor, most matchups end up being 2v1 or 1v1. The flip side of the lack of coordination is that you'll almost never get help from teammates. It's common to see an enemy player walk into the middle of my team, attack someone, and then walk out while literally no one else notices. If the person being attacked is you, the other healer typically won't notice and will continue healing someone at full health and none of the classic "peel" characters will help or even notice what's happening. That means it's on you to pay attention to your surroundings and watching flank routes to avoid getting murdered.

If you can avoid getting murdered constantly and actually try to heal (as opposed to many healers at low ranks, who will try to kill people or stick to a single character and continue healing them all the time even if they're at full health), you outheal a primary healer half the time when playing an off healer and, as a primary healer, you'll usually be able to get 10k-12k healing per 10 min compared to 6k to 8k for most people in Silver (sometimes less if they're playing DPS Moira). That's like having an extra half a healer on your team, which basically makes the game 6.5 v 6 instead of 6v6. You can still lose a 6.5v6 game, and you'll lose plenty of games, but if you're consistently healing 50% more than an normal healer at your rank, you'll tend to move up even if you get a lot of major things wrong (heal order, healing when that only feeds the other team, etc.).

A corollary to having to watch out for yourself 95% when playing a healer is that, as a character who can peel, you can actually watch out for your teammates and put your team at a significant advantage in 95% of games. As Zarya or Hog, if you just boringly play towards the front of your team, you can basically always save at least one teammate from death in a team fight, and you can often do this 2 or 3 times. Meanwhile, your counterpart on the other team is walking around looking for 1v1 matchups. If they find a good one, they'll probably kill someone, and if they don't (if they run into someone with a mobility skill or a counter like brig or reaper), they won't. Even in the case where they kill someone and you don't do a lot, you still provide as much value as them and, on average, you'll provide more value. A similar thing is true of many DPS characters, although it depends on the character (e.g., McCree is effective as a peeler, at least at the low ranks that I've played in). If you play a non-sniper DPS that isn't suited for peeling, you can find a DPS on your team who's looking for 1v1 fights and turn those fights into 2v1 fights (at low ranks, there's no shortage of these folks on both teams, so there are plenty of 1v1 fights you can control by making them 2v1).

All of these things I've mentioned amount to actually trying to help your team instead of going for flashy PotG setups or trying to dominate the entire team by yourself. If you say this in the abstract, it seems obvious, but most people think they're better than their rating. It doesn't help that OW is designed to make people think they're doing well when they're not and the best way to get "medals" or "play of the game" is to play in a way that severely reduces your odds of actually winning each game.

Outside of obvious gameplay mistakes, the other big thing that loses games is when someone tilts and either starts playing terribly or flips out and says something to enrage someone else on the team, who then starts playing terribly. I don't think you can actually do much about this directly, but you can never do this, so 5/6th of your team will do this at some base rate, whereas 6/6 of the other team will do this. Like all of the above, this won't cause you to win all of your games, but everything you do that increases your win rate makes a difference.

Poker players have the right attitude when they talk about leaks. The goal isn't to win every hand, it's to increase your EV by avoiding bad blunders (at high levels, it's about more than avoiding bad blunders, but we're talking about getting out of below median ranks, not becoming GM here). You're going to have terrible games where you get 5 people instalocking DPS. Your odds of winning a game are low, say 10%. If you get mad and pick DPS and reduce your odds even further (say this is to 2%), all that does is create a leak in your win rate during games when your teammates are being silly.

If you gain/lose 25 rating per game for a win or a loss, your average rating change from a game is 25 (W_rate - L_rate) = 25 (2W_rate - 1). Let's say 1/40 games are these silly games where your team decides to go all DPS. The per-game SR difference of trying to win these vs. soft throwing is maybe something like 1/40 * 25 (2 * 0.08) = 0.1. That doesn't sound like much and these numbers are just guesses, but everyone outside of very high-level games is full of leaks like these, and they add up. And if you look at a 60% win rate, which is pretty good considering that your influence is limited because you're only one person on a 6 person team, that only translates to an average of 5SR per game, so it doesn't actually take that many small leaks to really move your average SR gain or loss.

Appendix: general comments on online gaming, 20 years ago vs. today

Since I'm unlikely to write another blog post on gaming any time soon, here are some other random thoughts that won't fit with any other post. My last serious experience with online games was with a game from the 90s. Even though I'd heard that things were a lot worse, I was still surprised by it. IRL, the only time I encounter the same level and rate of pointless nastiness in a recreational activity is down at the bridge club (casual bridge games tend to be very nice). When I say pointless nastiness, I mean things like getting angry and then making nasty comments to a teammate mid-game. Even if your "criticism" is correct (and, if you review OW games or bridge hands, you'll see that these kinds of angry comments are almost never correct), this has virtually no chance of getting your partner to change their behavior and it has a pretty good chance of tilting them and making them play worse. If you're trying to win, there's no reason to do this and good reason to avoid this.

If you look at the online commentary for this, it's common to see people blaming kids, but this doesn't match my experience at all. For one thing, when I was playing video games in the 90s, a huge fraction of the online gaming population was made up of kids, and online game communities were nicer than they are today. Saying that "kids nowadays" are worse than kids used to be is a pastime that goes back thousands of years, but it's generally not true and there doesn't seem to be any reason to think that it's true here.

Additionally, this simply doesn't match what I saw. If I just look at comments over audio chat, there were a couple of times when some kids were nasty, but almost all of the comments are from people who sound like adults. Moreover, if I look at when I played games that were bad, a disproportionately large number of those games were late (after 2am eastern time, on the central/east server), where the relative population of adults is larger.

And if we look at bridge, the median age of an ACBL member is in the 70s, with an increase in age of a whopping 0.4 years per year.

Sure, maybe people tend to get more mature as they age, but in any particular activity, that effect seems to be dominated by other factors. I don't have enough data at hand to make a good guess as to what happened, but I'm entertained by the idea that this might have something to do with it:

I’ve said this before, but one of the single biggest culture shocks I’ve ever received was when I was talking to someone about five years younger than I was, and she said “Wait, you play video games? I’m surprised. You seem like way too much of a nerd to play video games. Isn’t that like a fratboy jock thing?”

Appendix: FAQ

Here are some responses to the most common online comments.

Plat? You suck at Overwatch

Yep. But I sucked roughly equally on both accounts (actually somewhat more on the masculine account because it was rated higher and I was playing a bit out of my depth). Also, that's not a question.

This is just a blog post, it's not an academic study, the results are crap.

There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.

When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my abillity) even if it makes the result look less interesting or less likely to win an award.

Gamers have always been toxic, that's just nostalgia talking.

If I pull game logs for subspace, this seems to be false. YMMV depending on what games you played, I suppose. FWIW, airmash seems to be the modern version of subspace, and (until the game died), it was much more toxic than subspace even if you just compare on a per-game basis despite having much smaller games (25 people for a good sized game in airmash, vs. 95 for subsace).

This is totally invalid because you didn't talk on voice chat.

At the ranks I played, not talking on voice was the norm. It would be nice to have talking or not talking on voice chat be an indepedent variable, but that would require playing even more games to get data for another set of conditions, and if I wasn't going to do that, choosing the condition that's most common doesn't make the entire experiment invalid, IMO.

Some people report that, post "endorsements" patch, talking on voice chat is much more common. I tested this out by playing 20 (non-comp) games just after the "Paris" patch. Three had comments on voice chat. One was someone playing random music clips, one had someone screaming at someone else for playing incorrectly, and one had useful callouts on voice chat. It's possible I'd see something different with more games or in comp, but I don't think it's obvious that voice chat is common for most people after the "endorsements" patch.

Appendix: code and data

If you want to play with this data and model yourself, experiment with different priors, run a posterior predictive check, etc., here's a snippet of R code that embeds the data:

library(brms)
library(modelr)
library(tidybayes)
library(tidyverse)

d <- tribble(
  ~game_type, ~gender, ~xplain, ~games,
  "comp", "female", 7, 35,
  "comp", "male", 1, 23,
  "qp", "female", 6, 149,
  "qp", "male", 2, 132
)

d <- d %>% mutate(female = ifelse(gender == "female", 1, 0), comp = ifelse(game_type == "comp", 1, 0))


result <-
  brm(data = d, family = binomial,
      xplain | trials(games) ~ female + comp,
      prior = c(set_prior("normal(0,10)", class = "b")),
      iter = 25000, warmup = 500, cores = 4, chains = 4)

The model here is simple enough that I wouldn't expect the version of software used to significantly affect results, but in case you're curious, this was done with brms 2.7.0, rstan 2.18.2, on R 3.5.1.

Thanks to Leah Hanson, Sean Talts and Sean's math/stats reading group, Annie Cherkaev, Robert Schuessler, Wesley Aptekar-Cassels, Julia Evans, Paul Gowder, Jonathan Dahan, Bradley Boccuzzi, Akiva Leffert, and one or more anonymous commenters for comments/corrections/discussion.

Fsyncgate: errors on fsync are unrecovarable

2018-03-28 08:00:00

This is an archive of the original "fsyncgate" email thread. This is posted here because I wanted to have a link that would fit on a slide for a talk on file safety with a mobile-friendly non-bloated format.

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject:Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date:2018-03-28 02:23:46

Hi all

Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error.

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

Pg wrote some blocks, which went to OS dirty buffers for writeback. Writeback failed due to an underlying storage error. The block I/O layer and XFS marked the writeback page as failed (AS_EIO), but had no way to tell the app about the failure. When Pg called fsync() on the FD during the next checkpoint, fsync() returned EIO because of the flagged page, to tell Pg that a previous async write failed. Pg treated the checkpoint as failed and didn't advance the redo start position in the control file.

All good so far.

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

The write never made it to disk, but we completed the checkpoint, and merrily carried on our way. Whoops, data loss.

The clear-error-and-continue behaviour of fsync is not documented as far as I can tell. Nor is fsync() returning EIO unless you have a very new linux man-pages with the patch I wrote to add it. But from what I can see in the POSIX standard we are not given any guarantees about what happens on fsync() failure at all, so we're probably wrong to assume that retrying fsync( ) is safe.

If the server had been using ext3 or ext4 with errors=remount-ro, the problem wouldn't have occurred because the first I/O error would've remounted the FS and stopped Pg from continuing. But XFS doesn't have that option. There may be other situations where this can occur too, involving LVM and/or multipath, but I haven't comprehensively dug out the details yet.

It proved possible to recover the system by faking up a backup label from before the first incorrectly-successful checkpoint, forcing redo to repeat and write the lost blocks. But ... what a mess.

I posted about the underlying fsync issue here some time ago:

https://stackoverflow.com/q/42434872/398670

but haven't had a chance to follow up about the Pg specifics.

I've been looking at the problem on and off and haven't come up with a good answer. I think we should just PANIC and let redo sort it out by repeating the failed write when it repeats work since the last checkpoint.

The API offered by async buffered writes and fsync offers us no way to find out which page failed, so we can't just selectively redo that write. I think we do know the relfilenode associated with the fd that failed to fsync, but not much more. So the alternative seems to be some sort of potentially complex online-redo scheme where we replay WAL only the relation on which we had the fsync() error, while otherwise servicing queries normally. That's likely to be extremely error-prone and hard to test, and it's trying to solve a case where on other filesystems the whole DB would grind to a halt anyway.

I looked into whether we can solve it with use of the AIO API instead, but the mess is even worse there - from what I can tell you can't even reliably guarantee fsync at all on all Linux kernel versions.

We already PANIC on fsync() failure for WAL segments. We just need to do the same for data forks at least for EIO. This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us.

There are rather a lot of pg_fsync() callers. While we could handle this case-by-case for each one, I'm tempted to just make pg_fsync() itself intercept EIO and PANIC. Thoughts?


From:Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date:2018-03-28 03:53:08

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-03-29 02:30:59

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 02:48:27

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

"Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen."

That's... I'm speechless.


From:Justin Pryzby <pryzby(at)telsasoft(dot)com>
Date:2018-03-29 05:00:31

On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article? https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to change it:

"Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

1open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
2write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
3write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
4write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
5write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
6write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
7write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
8write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
9write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)
10dup(2)                                  = 4
11fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
12brk(NULL)                               = 0x1299000
13brk(0x12ba000)                          = 0x12ba000
14fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
16) = 34
17close(4)                                = 0
18fsync(3)                                = -1 EIO (Input/output error)
19dup(2)                                  = 4
20fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
21fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
23) = 29
24close(4)                                = 0
25close(3)                                = 0
26open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27fsync(3)                                = 0
28write(3, "\0", 1)                       = 1
29fsync(3)                                = 0
30exit_group(0)                           = ?

2: EIO isn't seen initially due to writeback page cache;

9: ENOSPC due to small device

18: original IO error reported by fsync, good

25: the original FD is closed

26: ..and file reopened

27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the 2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd byte certainly never got written to DM error, and the failure status was lost following fsync().

I get the exact same behavior if I break after one write() loop, such as to avoid ENOSPC.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 05:06:22

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby wrote:

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, the page is still dirty. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:25:51

On 29 March 2018 at 13:06, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby wrote:

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, the page is still dirty. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around). http://www.enterprisedb.com

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:32:43

On 29 March 2018 at 10:48, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error once. It then forgets about it. So yes, that seems like a variant of the statement:

"Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:35:47

On 29 March 2018 at 10:30, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint later, which will retry the fsync(). Which will now appear to succeed because the kernel forgot that it lost our writes after telling us the first time. So we do check the error code, which returns success, and we complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints without repeating the writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it anyway.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:58:45

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going to lose a write, it should at minimum mark the FD as broken so no further fsync() or anything else can succeed on the FD, and an app that cares about durability must repeat the whole set of work since the prior succesful fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms behave? And how long before these kernels are out of use? We'd better deal with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note also the test case link there. https://stackoverflow.com/a/42436054/398670

Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.

It's not useless, it's just telling us something other than what we think it means. The promise it seems to give us is that if it reports an error once, everything after that is useless, so we should throw our toys, close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I thought in some other areas too. See that lwn.net article he linked.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to mean that once there's been an error reported for some IO operations there's no guarantee those operations are ever completed even after a subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 12:07:56

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer wrote:

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

FWIW my reading is the same as Tom's. It says "all data for the open file descriptor" without qualification or special treatment after errors. Not "some".

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be "a mess" and "a surprise" in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment "Failed write, redirty."


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 13:15:10

On 29 March 2018 at 20:07, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer wrote:

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

In more ways than one ;)

I'm not seeking to defend what the kernel seems to be doing. Rather, saying

that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be "a mess" and "a surprise" in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment "Failed write, redirty."

Ok, that's reassuring, but doesn't help us on the platform the great majority of users deploy on :(

"If on Linux, PANIC"

Hrm.


From:Catalin Iacob <iacobcatalin(at)gmail(dot)com>
Date:2018-03-29 16:20:00

On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro wrote:

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So, what are the options at this point? The assumption was that we can repeat the fsync (which as you point out is not the case), or shut down the database and perform recovery from WAL", the same Jeff Layton seems to agree PANIC is the appropriate response: "Replaying the WAL synchronously sounds like the simplest approach when you get an error on fsync. These are uncommon occurrences for the most part, so having to fall back to slow, synchronous error recovery modes when this occurs is probably what you want to do.". And right after, he confirms the errseq_t patches are about always detecting this, not more: "The main thing I working on is to better guarantee is that you actually get an error when this occurs rather than silently corrupting your data. The circumstances where that can occur require some corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth reading as well: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed independently before reading your sentence above). The errseq_t patches landed in v4.13 by the way, so very recently.

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 21:18:14

On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob wrote:

Jeff's comments in the pull request that merged errseq_t are worth reading as well: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

Wow. It looks like there may be a separate question of when each filesystem adopted this new infrastructure?

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

The pre-errseq_t problems are beyond our control. There's nothing we can do about that in userspace (except perhaps abandon OS-buffered IO, a big project). We just need to be aware that this problem exists in certain kernel versions and be grateful to Layton for fixing it.

The dropped dirty flag problem is something we can and in my view should do something about, whatever we might think about that design choice. As Andrew Gierth pointed out to me in an off-list chat about this, by the time you've reached this state, both PostgreSQL's buffer and the kernel's buffer are clean and might be reused for another block at any time, so your data might be gone from the known universe -- we don't even have the option to rewrite our buffers in general. Recovery is the only option.

Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-03-31 13:24:28

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-31 16:13:09

On 31 March 2018 at 21:24, Anthony Iliopoulos wrote:

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.

I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.

I'll send a patch this week.


From:Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date:2018-03-31 16:38:12

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-01 00:20:38

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

That won't fix anything released already, so as per the information gathered something has to be done anyway. The discussion of this thread is spreading quite a lot actually.

Handling things at a low-level looks like a better plan for the backend. Tools like pg_basebackup and pg_dump also issue fsync's on the data created, we should do an equivalent for them, with some exit() calls in file_utils.c. As of now failures are logged to stderr but not considered fatal.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-01 00:58:22

On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:

On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop(at)altatus(dot)com> wrote:

 On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

 > >> Yeah, I see why you want to PANIC.
 > >
 > > Indeed. Even doing that leaves question marks about all the kernel
 > > versions before v4.13, which at this point is pretty much everything
 > > out there, not even detecting this reliably. This is messy.
 There may still be a way to reliably detect this on older kernel
 versions from userspace, but it will be messy whatsoever. On EIO
 errors, the kernel will not restore the dirty page flags, but it
 will flip the error flags on the failed pages. One could mmap()
 the file in question, obtain the PFNs (via /proc/pid/pagemap)
 and enumerate those to match the ones with the error flag switched
 on (via /proc/kpageflags). This could serve at least as a detection
 mechanism, but one could also further use this info to logically
 map the pages that failed IO back to the original file offsets,
 and potentially retry IO just for those file ranges that cover
 the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are rather stable, but I was merely responding to your original post comment regarding having a way of finding out which page(s) failed. I assume that indeed there would be no benefit, especially since those errors are usually not transient (typically they come from hard medium faults), and although a filesystem could theoretically mask the error by allocating a different logical block, I am not aware of any implementation that currently does that.

I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.

Fully agree, those cases should be sufficiently rare that a complex and possibly non-maintainable solution is not really warranted.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write() errors as fatal? Also, the kind of errors that ext4 detects with this option is at the superblock level and govern metadata rather than actual data writes (recall that those are buffered anyway, no actual device IO has to take place at the time of write()).


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-01 01:14:46

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive). Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-01 18:24:51

On Fri, Mar 30, 2018 at 10:18 AM, Thomas Munro wrote:

... on Linux only.

Apparently I was too optimistic. I had looked only at FreeBSD, which keeps the page around and dirties it so we can retry, but the other BSDs apparently don't (FreeBSD changed that in 1999). From what I can tell from the sources below, we have:

Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies
FreeBSD, Illumos: retrying fsync() after EIO tells the truth

Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?

http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631 https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-02 15:03:42

On 2 April 2018 at 02:24, Thomas Munro wrote:

Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?

I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 18:13:46

Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.

Why do dirty pages need to be kept around in the case of persistent errors? I don't think the lack of automatic recovery in that case is what anybody is complaining about. It's that the error goes away and there's no reasonable way to separate out such an error from some potential transient errors.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 18:53:20

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 19:32:45

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that.

Which isn't what I've suggested.

Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

Meh.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 20:38:06

On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).


From:Stephen Frost <sfrost(at)snowman(dot)net>
Date:2018-04-02 20:58:08

Greetings,

Anthony Iliopoulos (ailiop(at)altatus(dot)com) wrote:

On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean "run another fsync() to see if the error has resolved itself." Requiring that to mean "you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything" is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 23:05:44

Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:

(a) it still messes with the top-level error reporting so mixing it with callers that use fsync() and do care about errors will produce the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies to the entire requested range so you still don't know which particular pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations of the sync_file_range() call, so again action will need to be taken upon first error encountered for the particular ranges.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail. Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write(). The pattern should be roughly like: write() -> fsync() -> free(), rather than write() -> free() -> fsync(). For example, if a partition gets full upon fsync(), then the application has a chance to persist the data in a different location, while the kernel cannot possibly make this decision and recover.

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean "run another fsync() to see if the error has resolved itself." Requiring that to mean "you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything" is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.

I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above (although having the FS layer recovering from hard errors transparently is open for discussion from what it seems [1]). Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges). Another issue is that of reporting semantics (report and clear), which is also a design choice made to avoid having higher-resolution error tracking and the corresponding memory overheads [1].

[1] https://lwn.net/Articles/718734/


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 23:23:24

On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:

Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:

To quote sync_file_range(2):

   Warning
       This  system  call  is  extremely  dangerous and should not be used in portable programs.  None of these operations writes out the
       file's metadata.  Therefore, unless the application is strictly performing overwrites of already-instantiated disk  blocks,  there
       are no guarantees that the data will be available after a crash.  There is no user interface to know if a write is purely an over‐
       write.  On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible.   When
       writing  into  preallocated  space,  many filesystems also require calls into the block allocator, which this system call does not
       sync out to disk.  This system call does not flush disk write caches and thus does not provide any data integrity on systems  with
       volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go. We use sfr(2), but only to force the kernel's hand around writing back earlier without throwing away cache contents.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work though

Except on freebsd and solaris, and perhaps others.

, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.

That's not guaranteed at all, think NFS.

Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

It's broken behaviour justified post facto with the only rational that was available, which explains why it's so unconvincing. You could just say "this ship has sailed, and it's to onerous to change because xxx" and this'd be a done deal. But claiming this is reasonable behaviour is ridiculous.

Again, you could just continue to error for this fd and still throw away the data.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync()

If you open the same file with two fds, and write with one, and fsync with another that's definitely supposed to work. And sync() isn't a realistic replacement in any sort of way because it's obviously systemwide, and thus entirely and completely unsuitable. Nor does it have any sort of better error reporting behaviour, does it?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-02 23:27:35

On 3 April 2018 at 07:05, Anthony Iliopoulos wrote:

Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.

might simply fail.

It depends on why the error ocurred.

I originally identified this behaviour on a multipath system. Multipath defaults to "throw the writes away, nobody really cares anyway" on error. It seems to figure a higher level will retry, or the application will receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue).

Instead the kernel opts for marking those pages clean (since there is no other recovery strategy),

and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write().

This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it.

We provide applications with clear boundaries so they can know exactly what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last successful fsync is potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.

I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above

[citation needed]

What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours.

Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges).

Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-03 00:03:39

On Apr 2, 2018, at 16:27, Craig Ringer wrote:

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-03 00:05:09

On April 2, 2018 5:03:39 PM PDT, Christophe Pettus wrote:

On Apr 2, 2018, at 16:27, Craig Ringer wrote:

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-03 00:07:41

On Apr 2, 2018, at 17:05, Andres Freund wrote:

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

For sure on the second sentence; the first was not clear to me.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-03 00:48:00

On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund wrote:

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

I wonder how bad it will be in practice if we PANIC. Craig said "This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us". It would be nice to get more information on that.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-03 01:29:28

On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer wrote:

I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

Yeah, it'd be nice to give an administrator the chance to free up some disk space after ENOSPC is reported, and stay up. Running out of space really shouldn't take down the database without warning! The question is whether the data remains in cache and marked dirty, so that retrying is a safe option (since it's potentially gone from our own buffers, so if the OS doesn't have it the only place your committed data can definitely still be found is the WAL... recovery time). Who can tell us? Do we need a per-filesystem answer? Delayed allocation is a somewhat filesystem-specific thing, so maybe. Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
Illumos/Solaris, Linux, macOS: yes

I don't know if macOS really means it or not; it just tells you to see the errors for read(2) and write(2). By the way, speaking of macOS, I was curious to see if the common BSD heritage would show here. Yeah, somewhat. It doesn't appear to keep buffers on writeback error, if this is the right code1.

[1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-03 02:54:26

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote:

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to throw away the user's data. If the writes are going to fail, then let them keep on failing every time. That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time. We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-03 03:45:30

On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas wrote:

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

I fear that the conventional wisdom from the Kernel people is now "you should be using O_DIRECT for granular control". The LWN article Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o:

"Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT."


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 10:35:39

Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote:

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.

throw away the user's data. If the writes are going to fail, then let them keep on failing every time. That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through. Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.

dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then

That's the only way to think about fsync() guarantees unless you are on a kernel that keeps retrying to persist dirty pages. Assuming such a model, after repeated and unrecoverable hard failures the process would have to explicitly inform the kernel to drop the dirty pages. All the process could do at that point is read back to userspace the dirty/failed pages and attempt to rewrite them at a different place (which is current possible too). Most applications would not bother though to inform the kernel and drop the permanently failed pages; and thus someone eventually would hit the case that a large amount of failed writeback pages are running his server out of memory, at which point people will complain that those semantics are completely unreasonable.

we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.

Again, conflating two separate issues, that of buffering and retrying failed pages and that of error reporting. Yes it would be convenient for applications not to have to care at all about recovery of failed write-backs, but at some point they would have to face this issue one way or another (I am assuming we are always talking about hard failures, other kinds of failures are probably already being dealt with transparently at the kernel level).

As for the reporting, it is also unreasonable to effectively signal and persist an error on a file-wide granularity while it pertains to subsets of that file and other writes can go through, but I am repeating myself.

I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-03 11:26:05

On 3 April 2018 at 11:35, Anthony Iliopoulos wrote:

Hi Robert,

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.

This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.

The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program using fsync for its original intended purpose of guaranteeing that the all writes are synced to disk it no longer has any guarantee at all.

This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.

It was always documented as level-triggered. This edge-triggered concept is a completely surprise to application writers.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 13:36:47

On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:

On 3 April 2018 at 11:35, Anthony Iliopoulos wrote:

Hi Robert,

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading and writing to completely disjoint regions from those of process A, and as such not really caring about earlier failures, only wanting to ensure its own writes go all the way through. But even if it did care, the file interfaces make no transactional guarantees. Even without fsync() there is nothing preventing process B from reading dirty pages from process A, and based on their content proceed to to its own business and write/persist new data, while process A further modifies the not-yet-flushed pages in-memory before flushing. In this case you'd need explicit synchronization/locking between the processes anyway, so why would fsync() be an exception?

This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.

The errseq_t fixes deal with that; errors will be reported to any process that has an open fd, irrespective to who is the actual caller of the fsync() that may have induced errors. This is anyway required as the kernel may evict dirty pages on its own by doing writeback and as such there needs to be a way to report errors on all open fds.

The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error and proceed or some other process will need to leverage that without coordination, or which writes actually failed for that matter. We would be back to the case of requiring explicit synchronization between processes that care about this, in which case the processes may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-03 14:29:10

On 3 April 2018 at 10:54, Robert Haas wrote:

I think it's always unreasonable to throw away the user's data.

Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly.

I don't agree with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane.

If the writes are going to fail, then let them keep on failing every time.

Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win.

That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow.

Also, this really does make it impossible to write reliable programs.

In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes you did.

Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.

Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time.

It's also spikier. Users have more issues with latency with short, frequent checkpoints.

We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we transfer ownership to the OS. Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!)

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to restore the session to a usable state. This is good.

But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC tomorrow, propose a patch, report back.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-03 14:37:30

On 3 April 2018 at 14:36, Anthony Iliopoulos wrote:

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By "clear the error" you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error reporting system anyways. It just needs to know when all writes have been synced to disk.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 16:52:07

On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:

On 3 April 2018 at 14:36, Anthony Iliopoulos wrote:

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By "clear the error" you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?

It means that the responsibility of recovering the data is passed back to the application. The writes may never be able to be written out. How would a kernel deal with that? Either discard the data (and have the writer acknowledge) or buffer the data until reboot and simply risk going OOM. It's not what someone would want, but rather need to deal with, one way or the other. At least on the application-level there's a fighting chance for restoring to a consistent state. The kernel does not have that opportunity.

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable interfaces (e.g the sync_file_range()). While it's nice to avoid platform specific hacks, expecting that the POSIX semantics will be consistent across systems is simply a 90's pipe dream. While it would be lovely to have really consistent interfaces for application writers, this is simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent in other systems as well that are not likely to be changed, you cannot rely on preconception that once buffers are handed over to kernel you have a guarantee that they will be eventually persisted no matter what. (Why even bother having fsync() in that case? The kernel would eventually evict and writeback dirty pages anyway. The point of reporting the error back to the application is to give it a chance to recover - the kernel could repeat "fsync()" itself internally if this would solve anything).

reporting system anyways. It just needs to know when all writes have been synced to disk.

Well, it does know when some writes have not been synced to disk, exactly because the responsibility is passed back to the application. I do realize this puts more burden back to the application, but what would a viable alternative be? Would you rather have a kernel that risks periodically going OOM due to this design decision?


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-03 21:47:01

On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos wrote:

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD, POSIX.1-2001, which on the first system I tested, it did. Also, the summary should be changed from the current "fsync, fdatasync - synchronize a file's in-core state with storage device" by adding ", possibly by randomly undoing some of the changes you think you made to the file".

I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through.

No, that's not questionable at all. fsync() doesn't take any argument saying which part of the file you care about, so the kernel is entirely not entitled to assume it knows to which writes a given fsync() call was intended to apply.

Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.

I don't deny that it's possible that somebody could have an application which is utterly indifferent to the fact that earlier modifications to a file failed due to I/O errors, but is A-OK with that as long as later modifications can be flushed to disk, but I don't think that's a normal thing to want.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error.

Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure. What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them. The documentation for fsync() gives zero indication that it's edge-triggered, and so complaining that people wouldn't like it if it became level-triggered seems like an ex post facto justification for a poorly-chosen behavior: they probably think (as we did prior to a week ago) that it already is.

Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.

Well, the way PostgreSQL works today, we typically run with say 8GB of shared_buffers even if the system memory is, say, 200GB. As pages are evicted from our relatively small cache to the operating system, we track which files need to be fsync()'d at checkpoint time, but we don't hold onto the blocks. Until checkpoint time, the operating system is left to decide whether it's better to keep caching the dirty blocks (thus leaving less memory for other things, but possibly allowing write-combining if the blocks are written again) or whether it should clean them to make room for other things. This means that only a small portion of the operating system memory is directly managed by PostgreSQL, while allowing the effective size of our cache to balloon to some very large number if the system isn't under heavy memory pressure.

Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that "real men use O_DIRECT" and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.

Unfortunately, that is going to add a huge amount of complexity, because if we ran with shared_buffers set to a large percentage of system memory, we couldn't allocate large chunks of memory for sorts and hash tables from the operating system any more. We'd have to allocate it from our own shared_buffers because that's basically all the memory there is and using substantially more might run the system out entirely. So it's a huge, huge architectural change. And even once it's done it is in some ways inferior to what we are doing today -- true, it gives us superior control over writeback timing, but it also makes PostgreSQL play less nicely with other things running on the same machine, because now PostgreSQL has a dedicated chunk of whatever size it has, rather than using some portion of the OS buffer cache that can grow and shrink according to memory needs both of other parts of PostgreSQL and other applications on the system.

I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.

That would certainly be better than nothing.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-03 23:59:27

On Tue, Apr 3, 2018 at 1:29 PM, Thomas Munro wrote:

Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no Illumos/Solaris, Linux, macOS: yes

Oops, reading comprehension fail. POSIX yes (since issue 5), via the note that read() and write()'s error conditions can also be returned.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 00:56:37

On Tue, Apr 3, 2018 at 05:47:01PM -0400, Robert Haas wrote:

Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure.

There has been a lot of focus in this thread on the workflow:

write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 01:54:50

On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian wrote:

There has been a lot of focus in this thread on the workflow:

    write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

    write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that "bug #2", with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 02:05:19

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian wrote:

There has been a lot of focus in this thread on the workflow:

    write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

    write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that "bug #2", with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.

Yes, that one I understood.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 02:14:28

On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 02:40:16

On 4 April 2018 at 05:47, Robert Haas wrote:

Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that "real men use O_DIRECT" and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.

I looked into buffered AIO a while ago, by the way, and just ... hell no. Run, run as fast as you can.

The trouble with direct I/O is that it pushes a lot of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.

We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 02:44:22

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

https://github.com/torvalds/linux/blob/master/mm/filemap.c#L682

When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven't been any).

But I'm not sure what the lifetime of the passed-in "file" and more importantly "file->f_wb_err" is. Specifically, what happens to it if no one has the file open at all, between operations? It is reference counted, see fs/file_table.c. I don't know enough about it to comment.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 05:29:28

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

If so I'm not sure how that can possibly be considered to be an implementation of _POSIX_SYNCHRONIZED_IO: "the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state." Note "the file", not "this file descriptor + copies", and without reference to when you opened it.

But I'm not sure what the lifetime of the passed-in "file" and more importantly "file->f_wb_err" is.

It's really inode->i_mapping->wb_err's lifetime that I should have been asking about there, not file->f_wb_err, but I see now that that question is irrelevant due to the above.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 06:00:21

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Does that mean that the ONLY ways to do reliable I/O are:

  • single-process, single-file-descriptor write() then fsync(); on failure, retry all work since last successful fsync()
  • direct I/O

?


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 07:32:04

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Does that mean that the ONLY ways to do reliable I/O are:

  • single-process, single-file-descriptor write() then fsync(); on failure, retry all work since last successful fsync()

I suppose you could some up with some crazy complicated IPC scheme to make sure that the checkpointer always has an fd older than any writes to be flushed, with some fallback strategy for when it can't take any more fds.

I haven't got any good ideas right now.

  • direct I/O

As a bit of an aside, I gather that when you resize files (think truncating/extending relation files) you still need to call fsync() even if you read/write all data with O_DIRECT, to make it flush the filesystem meta-data. I have no idea if that could also be affected by eaten writeback errors.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 07:51:53

On 4 April 2018 at 14:00, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/testcases/fsync-error-clear now.

Warning, this runs a Docker container in privileged mode on your system, and it uses devicemapper. Read it before you run it, and while I've tried to keep it safe, beware that it might eat your system.

For now it tests only xfs and EIO. Other FSs should be easy enough.

I haven't added coverage for multi-processing yet, but given what you found above, I should. I'll probably just system() a copy of the same proc with instructions to only fsync(). I'll do that next.

I haven't worked out a reliable way to trigger ENOSPC on fsync() yet, when mapping without the error hole. It happens sometimes but I don't know why, it almost always happens on write() instead. I know it can happen on nfs, but I'm hoping for a saner example than that to test with. ext4 and xfs do delayed allocation but eager reservation so it shouldn't happen to them.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 13:49:38

On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any errors that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
    write, fsync, close                5360.341 ops/sec     187 usecs/op
    write, close, fsync                4785.240 ops/sec     209 usecs/op

Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.

I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.

Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(

The last time I remember being this surprised about storage was in the early Postgres years when we learned that just because the BSD file system uses 8k pages doesn't mean those are atomically written to storage. We knew the operating system wrote the data in 8k chunks to storage but:

  • the 8k pages are written as separate 512-byte sectors
  • the 8k might be contiguous logically on the drive but not physically
  • even 512-byte sectors are not written atomically

This is why we added pre-page images are written to WAL, which is what full_page_writes controls.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 13:53:01

On Wed, Apr 4, 2018 at 10:40:16AM +0800, Craig Ringer wrote:

The trouble with direct I/O is that it pushes a lot of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.

We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?

I believe this is how Oracle views the kernel, so there is precedent for this approach, though I am not advocating it.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:00:15

On 4 April 2018 at 15:51, Craig Ringer wrote:

On 4 April 2018 at 14:00, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/ testcases/fsync-error-clear now.

Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code reading suggests this is possible, but all the above FSes reserve space eagerly on write( ) even if they do delayed allocation of the actual storage, so it doesn't seem to happen at least in my simple single-process test.

I'm not overly inclined to complain about a fsync() succeeding after a write() error. That seems reasonable enough, the kernel told the app at the time of the failure. What else is it going to do? I don't personally even object hugely to the current fsync() behaviour if it were, say, DOCUMENTED and conformant to the relevant standards, though not giving us any sane way to find out the affected file ranges makes it drastically harder to recover sensibly.

But what's come out since on this thread, that we cannot even rely on fsync() giving us an EIO once when it loses our data, because:

  • all currently widely deployed kernels can fail to deliver info due to recently fixed limitation; and
  • the kernel deliberately hides errors from us if they relate to writes that occurred before we opened the FD (?)

... that's really troubling. I thought we could at least fix this by PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we can't even do that and expect reliability. So how the @#$ are we meant to do?

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:09:09

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

What bewilders me is that running with data=journal doesn't seem to be safe either. WTF?

[26438.846111] EXT4-fs (dm-0): mounted filesystem with journalled data
mode. Opts: errors=remount-ro,data_err=abort,data=journal
[26454.125319] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[26454.125326] Buffer I/O error on device dm-0, logical block 59393
[26454.125337] Buffer I/O error on device dm-0, logical block 59394
[26454.125343] Buffer I/O error on device dm-0, logical block 59395
[26454.125350] Buffer I/O error on device dm-0, logical block 59396

and splat, there goes your data anyway.

It's possible that this is in some way related to using the device-mapper "error" target and a loopback device in testing. But I don't really see how.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 14:25:47

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:42:18

On 4 April 2018 at 22:25, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.

Yep, I gathered. I was referring to data_err.


From:Antonis Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-04 15:23:31

On Wed, Apr 4, 2018 at 4:42 PM, Craig Ringer wrote:

On 4 April 2018 at 22:25, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.

Yep, I gathered. I was referring to data_err.

As far as I recall data_err=abort pertains to the jbd2 handling of potential writeback errors. Jbd2 will inetrnally attempt to drain the data upon txn commit (and it's even kind enough to restore the EIO at the address space level, that otherwise would get eaten).

When data_err=abort is set, then jbd2 forcibly shuts down the entire journal, with the error being propagated upwards to ext4. I am not sure at which point this would be manifested to userspace and how, but in principle any subsequent fs operations would get some filesystem error due to the journal being down (I would assume similar to remounting the fs read-only).

Since you are using data=journal, I would indeed expect to see something more than what you saw in dmesg.

I can have a look later, I plan to also respond to some of the other interesting issues that you guys raised in the thread.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 15:23:51

On 4 April 2018 at 21:49, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any errors that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:

    Test if fsync on non-write file descriptor is honored:
    (If the times are similar, fsync() can sync data written on a

different descriptor.) write, fsync, close 5360.341 ops/sec 187 usecs/op write, close, fsync 4785.240 ops/sec 209 usecs/op

Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.

I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just discussed how Pg relies on being able to open and close files freely. That may not be as reasonable a thing to do as we thought it was when you consider error reporting. What's the kernel meant to do? How long should it remember "I had an error while doing writeback on this file"? Should it flag the file metadata and remember across reboots? Obviously not, but where does it stop? Tell the next program that does an fsync() and forget? How could it associate a dirty buffer on a file with no open FDs with any particular program at all? And what if the app did a write then closed the file and went away, never to bother to check the file again, like most apps do?

Some I/O errors are transient (network issue, etc). Some are recoverable with some sort of action, like disk space issues, but may take a long time before an admin steps in. Some are entirely unrecoverable (disk 1 in striped array is on fire) and there's no possible recovery. Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and opening/closing FDs. I don't know what the "right" thing for the kernel to do from our perspective even is here, but the best I can come up with is actually pretty close to what it does now. Report the fsync() error to the first process that does an fsync() since the writeback error if one has occurred, then forget about it. Ideally I'd have liked it to mark all FDs pointing to the file with a flag to report EIO on next fsync too, but it turns out that won't even help us due to our opening and closing behaviour, so we're going to have to take responsibility for handling and communicating that ourselves, preventing checkpoint completion if any backend gets an fsync error. Probably by PANICing. Some extra work may be needed to ensure reliable ordering and stop checkpoints completing if their fsync() succeeds due to a recent failed fsync() on a normal backend that hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for > kernel write failure messages, and then restore or failover. :-( >

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-04 17:23:58

On 04. 04. 2018 15:49, Bruce Momjian wrote:

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.

Ideally the kernel would keep its data for as little time as possible. With fsync, it doesn't really know which process is interested in knowing about a write error, it just assumes the caller will know how to deal with it. Most unfortunate issue is there's no way to get information about a write error.

Thinking aloud - couldn't/shouldn't a write error also be a file system event reported by inotify? Admittedly that's only a thing on Linux, but still.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 17:51:03

On Wed, Apr 4, 2018 at 11:23:51PM +0800, Craig Ringer wrote:

On 4 April 2018 at 21:49, Bruce Momjian wrote:

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

You would almost need an API that registers before the failure that you care about sync failures, and that you plan to call fsync() to gather such information. I am not sure how you would allow more than the first fsync() to see the failure unless you added another API to clear the fsync failure, but I don't see the point since the first fsync() might call that clear function. How many applications are going to know there is another application that cares about the failure? Not many.

Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

Agreed.

Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.

I assume that is the kernel logs. I am thinking the kernel logs have to be monitored, but how many administrators do that? The other issue I think you are pointing out is how is the administrator going to know this is a Postgres file? I guess any sync error to a device that contains Postgres has to assume Postgres is corrupted. :-(


see explicit treatment of retrying, though I'm not entirely sure if the retry flag is set just for async write-back), and apparently unlike every other kernel I've tried to grok so far (things descended from ancestral BSD but not descended from FreeBSD, with macOS/Darwin apparently in the first category for this purpose).

Here's a new ticket in the NetBSD bug database for this stuff:

http://gnats.netbsd.org/53152

As mentioned in that ticket and by Andres earlier in this thread, keeping the page dirty isn't the only strategy that would work and may be problematic in different ways (it tells the truth but floods your cache with unflushable stuff until eventually you force unmount it and your buffers are eventually invalidated after ENXIO errors? I don't know.). I have no qualified opinion on that. I just know that we need a way for fsync() to tell the truth about all preceding writes or our checkpoints are busted.

*We mmap() + msync() in pg_flush_data() if you don't have sync_file_range(), and I see now that that is probably not a great idea on ZFS because you'll finish up double-buffering (or is that triple-buffering?), flooding your page cache with transient data. Oops. That is off-topic and not relevant for the checkpoint correctness topic of this thread through, since pg_flush_data() is advisory only.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 22:14:24

On Thu, Apr 5, 2018 at 9:28 AM, Thomas Munro wrote:

On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer wrote:

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

While contemplating what exactly it would do (not sure),

See manual for failmode=wait | continue | panic. Even "continue" returns EIO to all new write requests, so they apparently didn't bother to supply an 'eat-my-data-but-tell-me-everything-is-fine' mode. Figures.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-05 07:09:57

Summary to date:

It's worse than I thought originally, because:

  • Most widely deployed kernels have cases where they don't tell you about losing your writes at all; and
  • Information about loss of writes can be masked by closing and re-opening a file

So the checkpointer cannot trust that a successful fsync() means ... a successful fsync().

Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.

I previously though that errors=remount-ro was a sufficient safeguard. It isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or xfs.

It's not clear to me yet why data_err=abort isn't sufficient in data=ordered or data=writeback mode on ext3 or ext4, needs more digging. (In my test tools that's: make FSTYPE=ext4 MKFSOPTS="" MOUNTOPTS="errors=remount-ro, data_err=abort,data=journal" as of the current version d7fe802ec). AFAICS that's because data_error=abort only affects data=ordered, not data=journal. If you use data=ordered, you at least get retries of the same write failing. This post https://lkml.org/lkml/2008/10/10/80 added the option and has some explanation, but doesn't explain why it doesn't affect data=journal.

zfs is probably not affected by the issues, per Thomas Munro. I haven't run my test scripts on it yet because my kernel doesn't have zfs support and I'm prioritising the multi-process / open-and-close issues.

So far none of the FSes and options I've tried exhibit the behavour I actually want, which is to make the fs readonly or inaccessible on I/O error.

ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().

I think what we really need is a block-layer fix, where an I/O error flips the block device into read-only mode, as if blockdev --setro had been used. Though I'd settle for a kernel panic, frankly. I don't think anybody really wants this, but I'd rather either of those to silent data loss.

I'm currently tweaking my test to do some close and reopen the file between each write() and fsync(), and to support running with nfs.

I've also just found the device-mapper "flakey" driver, which looks fantastic for simulating unreliable I/O with intermittent faults. I've been using the "error" target in a mapping, which lets me remap some of the device to always error, but "flakey" looks very handy for actual PostgreSQL testing.

For the sake of Google, these are errors known to be associated with the problem:

ext4, and ext3 mounted with ext4 driver:

[42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[42084.327352] Buffer I/O error on device dm-0, logical block 59393

xfs:

[42193.771367] XFS (dm-0): writeback error on sector 118784
[42193.784477] XFS (dm-0): writeback error on sector 118784

jfs: (nil, silence in the kernel logs)

You should also beware of "lost page write" or "lost write" errors.

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-05 08:46:08

On 5 April 2018 at 15:09, Craig Ringer wrote:

Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

I just confirmed this with a tweak to the test that

records the file position
close()s the fd
sync()s
open(s) the file
lseek()s back to the recorded position

This causes the test to completely ignore the I/O error, which is not reported to it at any time.

Fair enough, really, when you look at it from the kernel's point of view. What else can it do? Nobody has the file open. It'd have to mark the file its self as bad somehow. But that's pretty bad for our robustness AFAICS.

There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.

If dirty writeback fails between our close() and re-open() I see the same behaviour as with sync(). To test that I set dirty_writeback_centisecs and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between close() and open(). (It's still plenty slow). So sync() is a convenient way to simulate something other than our own fsync() writing out the dirty buffer.

If I omit the sync() then we get the error reported by fsync() once when we re open() the file and fsync() it, because the buffers weren't written out yet, so the error wasn't generated until we re-open()ed the file. But I doubt that'll happen much in practice because dirty writeback will get to it first so the error will be seen and discarded before we reopen the file in the checkpointer.

In other words, it looks like even with a new kernel with the error reporting bug fixes, if I understand how the backends and checkpointer interact when it comes to file descriptors, we're unlikely to notice I/O errors and fail a checkpoint. We may notice I/O errors if a backend does its own eager writeback for large I/O operations, or if the checkpointer fsync()s a file before the kernel's dirty writeback gets around to trying to flush the pages that will fail.

I haven't tested anything with multiple processes / multiple FDs yet, where we keep one fd open while writing on another.

But at this point I don't see any way to make Pg reliably detect I/O errors and fail a checkpoint then redo and retry. To even fix this by PANICing like I proposed originally, we need to know we have to PANIC.

AFAICS it's completely unsafe to write(), close(), open() and fsync() and expect that the fsync() makes any promises about the write(). Which if I read Pg's low level storage code right, makes it completely unable to reliably detect I/O errors.

When put it that way, it sounds fair enough too. How long is the kernel meant to remember that there was a write error on the file triggered by a write initiated by some seemingly unrelated process, some unbounded time ago, on a since-closed file?

But it seems to put Pg on the fast track to O_DIRECT.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-05 19:33:14

On Thu, Apr 5, 2018 at 03:09:57PM +0800, Craig Ringer wrote:

ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().

This does explain why NFS has a reputation for unreliability for Postgres.


From:Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Date:2018-04-05 23:37:42

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-06 01:27:05

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is

Meanwhile, I've extended my test to run postgres on a deliberately faulty volume and confirmed my results there.

2018-04-06 01:11:40.555 UTC [58] LOG:  checkpoint starting: immediate force
wait
2018-04-06 01:11:40.567 UTC [58] ERROR:  could not fsync file
"base/12992/16386": Input/output error
2018-04-06 01:11:40.655 UTC [66] ERROR:  checkpoint request failed
2018-04-06 01:11:40.655 UTC [66] HINT:  Consult recent messages in the
server log for details.
2018-04-06 01:11:40.655 UTC [66] STATEMENT:  CHECKPOINT



Checkpoint failed with checkpoint request failed
HINT:  Consult recent messages in the server log for details.



Retrying



2018-04-06 01:11:41.568 UTC [58] LOG:  checkpoint starting: immediate force
wait
2018-04-06 01:11:41.614 UTC [58] LOG:  checkpoint complete: wrote 0 buffers
(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s,
sync=0.000 s, total=0.046 s; sync files=3, longest=0.000 s, average=0.000
s; distance=2727 kB, estimate=2779 kB

Given your report, now I have to wonder if we even reissued the fsync() at all this time. 'perf' time. OK, with

sudo perf record -e syscalls:sys_enter_fsync,syscalls:sys_exit_fsync -a
sudo perf script

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

        postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
0x00000005
        postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync:
0xfffffffffffffffb
...
        postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
0x00000005
        postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752
[72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-06 02:53:56

On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer wrote:

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is

Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

    postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:

0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched. One thing you could try to confirm our understand of the Linux 4.13+ policy would be to hack PostgreSQL so that it reopens the file descriptor every time in mdsync(). See attached.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-06 03:20:22

On 6 April 2018 at 10:53, Thomas Munro wrote:

On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer wrote:

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is news-spur.riddles.org.uk

Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

    postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:

0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched.

In this case I think it's because the kernel didn't get around to doing the writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't even queue it for writeback from our own shared_buffers until just before we fsync()'d it. After all, it's a contrived test case that tries to reproduce the issue rapidly with big writes and frequent checkpoints.

So the checkpointer had the relation open to fsync() it, and it was the checkpointer's fsync() that did writeback on the dirty page and noticed the error.

If we the kernel had done the writeback before the checkpointer opened the relation to fsync() it, we might not have seen the error at all - though as you note this depends on the file descriptor cache. You can see the silent-error behaviour in my standalone test case where I confirmed the post-4.13 behaviour. (I'm on 4.14 here).

I can try to reproduce it with postgres too, but it not only requires closing and reopening the FDs, it also requires forcing writeback before opening the fd. To make it occur in a practical timeframe I have to make my kernel writeback settings insanely aggressive and/or call sync() before re-open()ing. I don't really think it's worth it, since I've confirmed the behaviour already with the simpler test in standalone/ in the rest repo. To try it yourself, clone

https://github.com/ringerc/scrapcode

and in the master branch

cd testcases/fsync-error-clear
less README
make REOPEN=reopen standalone-run

See https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear/standalone/fsync-error-clear.c#L118 .

I've pushed the postgres test to that repo too; "make postgres-run".

You'll need docker, and be warned, it's using privileged docker containers and messing with dmsetup.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-08 02:16:07

So, what can we actually do about this new Linux behaviour?

Idea 1:

  • whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)
  • if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-08 02:33:37

On Sun, Apr 8, 2018 at 02:16:07PM +1200, Thomas Munro wrote:

So, what can we actually do about this new Linux behaviour?

Idea 1:

  • whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)

  • if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 02:37:47

On Apr 7, 2018, at 19:33, Bruce Momjian wrote: Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.

Pragmatically, that's where we are right now. The best answer in this bad situation is (a) fix the error, then (b) replay from a checkpoint before the error occurred, but it appears we can't even guarantee that a PostgreSQL process will be the one to see the error.

-- -- Christophe Pettus xof(at)thebuild(dot)com

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 03:27:45

On 8 April 2018 at 10:16, Thomas Munro wrote:

So, what can we actually do about this new Linux behaviour?

Yeah, I've been cooking over that myself.

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

That way we can use something like your #1 (which is what I was also thinking about then rejecting previously), but reduce the pain by reducing the FD count drastically so exhausting FDs stops being a problem.

Previously I was leaning toward what you've described here:

  • whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)

  • if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

... and got stuck on "yuck, that's awful".

I was assuming we'd force early checkpoints if the checkpointer hit its fd limit, but that's even worse.

We'd need to urgently do away with segmented relations, and partitions would start to become a hinderance.

Even then it's going to be an unworkable nightmare with heavily partitioned systems, systems that use schema-sharding, etc. And it'll mean we need to play with process limits and, often, system wide limits on FDs. I imagine the performance implications won't be pretty.

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

This appears to be what SQLite does AFAICS.

https://www.sqlite.org/atomiccommit.html

though it has the huge luxury of a single writer, so it's probably only subject to the original issue not the multiprocess / checkpointer issues we face.

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

That seems to be what the kernel folks will expect. But that's going to KILL performance. We'll need writer threads to have any hope of it not totally sucking, because otherwise simple things like updating a heap tuple and two related indexes will incur enormous disk latencies.

But I suspect it's the path forward.

Goody.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?

I think our underlying data syncing concept is fundamentally broken, and it's not really the kernel's fault.

We assume that we can safely:

procA: open()
procA: write()
procA: close()

... some long time later, unbounded as far as the kernel is concerned ...

procB: open()
procB: fsync()
procB: close()

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

I never really clicked to the fact that we closed relations with pending buffered writes, left them closed, then reopened them to fsync. That's .... well, the kernel isn't the only thing doing crazy things here.

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Fun times.

This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST make sure you don't run out of disk. Because the usual safeguard of space reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested this with nfsv3 in sync,hard,nointr mode yet, maybe that's safe, but I doubt it). The same applies to thin-provisioned storage. Just. Don't.

This helps explain various reports of corruption in Docker and various other tools that use various sorts of thin provisioning. If you hit ENOSPC in fsync(), bye bye data.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-08 03:37:06

On Sat, Apr 7, 2018 at 8:27 PM, Craig Ringer wrote:

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

I'm not sure that we can do that now, since it would break the new "Optimize btree insertions for common case of increasing values" optimization. (I did mention this before it went in.)

I've asked Pavan to at least add a note to the nbtree README that explains the high level theory behind the optimization, as part of post-commit clean-up. I'll ask him to say something about how it might affect extent-based storage, too.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 03:46:17

On Apr 7, 2018, at 20:27, Craig Ringer wrote:

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.


From:Andreas Karlsson <andreas(at)proxel(dot)se>
Date:2018-04-08 09:41:06

On 04/08/2018 05:27 AM, Craig Ringer wrote:>

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 10:30:31

On 8 April 2018 at 11:46, Christophe Pettus wrote:

On Apr 7, 2018, at 20:27, Craig Ringer wrote:

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.

I don't entirely agree - because it affects ENOSPC, I/O errors on thin provisioned storage, I/O errors on multipath storage, etc. (I identified the original issue on a thin provisioned system that ran out of backing space, mangling PostgreSQL in a way that made no sense at the time).

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 10:31:24

On 8 April 2018 at 17:41, Andreas Karlsson wrote:

On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an idea #5: decide InnoDB has the right idea, and

go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html

Huh, thanks for the update.

We should see how they handle reliable flushing and see if they've looked into it. If they haven't, we should give them a heads-up and if they have, lets learn from them.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 16:38:03

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-08 21:23:21

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro wrote:

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel filesystem buffer provides no reliable interface so we need to reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we have the added barrier that we don't have as much information about the hardware as the kernel has. We don't know where raid stripes begin and end, how big the memory controller buffers are or how to tell when they're full or empty or how to flush them. etc etc. We also don't know what else is going on on the machine.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 21:28:43

On Apr 8, 2018, at 14:23, Greg Stark wrote:

They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

That's not an irrational position. File system buffers are not dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-08 21:47:04

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro wrote:

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been written out, due to reasons outlined before. As such it may make some sense to maintain some tracking regarding errors even after marking failed dirty pages as clean (in fact it has been proposed, but this introduces memory overhead).

again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes or not would not really help (see above). It is more a matter of proper "reporting contract" between userspace and kernel, and tracking would be a way for facilitating this vs. having a more complex userspace scheme (as described by others in this thread) where synchronization for fsync() is required in a multi-process application.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-08 22:29:16

On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 23:10:24

On Apr 8, 2018, at 15:29, Bruce Momjian wrote: I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.

Yeah, it's bad. In the short term, the best advice to installations is to monitor their kernel logs for errors (which very few do right now), and make sure they have a backup strategy which can encompass restoring from an error like this. Even Craig's smart fix of patching the backup label to recover from a previous checkpoint doesn't do much good if we don't have WAL records back that far (or one of the required WAL records also took a hit).

In the longer term... O_DIRECT seems like the most plausible way out of this, but that might be popular with people running on file systems or OSes that don't have this issue. (Setting aside the daunting prospect of implementing that.)


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-08 23:16:25

On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:

On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

I think the danger presented here is far smaller than some of the statements in this thread might make one think. In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed. Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There's a lot of not so great things here, but I don't think there's any need to panic.

We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.

I think there's pretty good reasons to go to direct IO where supported, but error handling doesn't strike me as a particularly good reason for the move.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 23:27:57

On Apr 8, 2018, at 16:16, Andres Freund wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.

That being said...

There's a lot of not so great things here, but I don't think there's any need to panic.

No reason to panic, yes. We can assume that if this was a very big persistent problem, it would be much more widely reported. It would, however, be good to find a way to get the error surfaced back up to the client in a way that is not just monitoring the kernel logs.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 01:31:56

On 9 April 2018 at 05:28, Christophe Pettus wrote:

On Apr 8, 2018, at 14:23, Greg Stark wrote:

They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

That's not an irrational position. File system buffers are not dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.

Yep.

Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 01:35:06

On 9 April 2018 at 06:29, Bruce Momjian wrote:

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.

Right.

Specifically, we need a way to ask the kernel at checkpoint time "was everything written to [this set of files] flushed successfully since the last time I asked, no matter who did the writing and no matter how the writes were flushed?"

If the result is "no" we PANIC and redo. If the hardware/volume is screwed, the user can fail over to a standby, do PITR, etc.

But we don't have any way to ask that reliably at present.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 01:55:10

Hi,

On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 16:16, Andres Freund wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.

Only if you're "lucky" enough that your clients actually read that data, and then you're somehow able to figure out across the whole stack that these 0.001% of transactions that fail are due to IO errors. Or you also need to do log analysis.

If you want to solve things like that you need regular reads of all your data, including verifications etc.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 02:00:41

On 9 April 2018 at 07:16, Andres Freund wrote:

I think the danger presented here is far smaller than some of the statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as "not our problem, the lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that.

In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed.

Your database is going to be down or have interrupted service. It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all, like:

  • Local disk fills up (seems to be safe already due to space reservation at write() time)
  • Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
  • NFS volume fills up
  • Multipath I/O error
  • Interruption of connectivity to network block device
  • Disk develops localized bad sector where we haven't previously written data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel won't retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

That's because reads don't make promises about what's committed and synced. I think that's quite different.

We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.

Certainly for current versions.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 02:06:12

On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

Agreed on that, but I think that's FAR more likely to be things like multixacts, index structure corruption due to logic bugs etc.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

And it'd be a realy bad idea to behave differently.

And in many failure modes there's no reason to expect any data loss at all, like:

  • Local disk fills up (seems to be safe already due to space reservation at write() time)

That definitely should be treated separately.

  • Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
  • NFS volume fills up

Those should be the same as the above.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least two orders of magnitude.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 03:15:01

On 9 April 2018 at 10:06, Andres Freund wrote:

And in many failure modes there's no reason to expect any data loss at all, like:

  • Local disk fills up (seems to be safe already due to space reservation at write() time)

That definitely should be treated separately.

It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this.

  • Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
  • NFS volume fills up

Those should be the same as the above.

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place.

(There's a proposed extension that'd allow this, see https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.)

Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux thought there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]
kernel: Sense Key : Data Protect [current]
kernel: sd 2:0:0:1: [sdd]
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB:
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787

The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it "completed" a "successful" checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space" I'd have said "Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the default for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.

I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it). Also warn against running on NFS even with "hard,sync,nointr".

It'd be interesting to have a tool that tested error handling, allowing people to do iSCSI plug-pull tests, that sort of thing. But as far as I can tell nobody ever tests their storage stack anyway, so I don't plan on writing something that'll never get used.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least two orders of magnitude.

Oh, it's just a minor total rewrite of half Pg, no big deal ;)

I'm sure that no matter how big I think it is, I'm still underestimating it.

The most workable option IMO would be some sort of fnotify/dnotify/whatever that reports all I/O errors on a volume. Some kind of error reporting handle we can keep open on a volume level that we can check for each volume/tablespace after we fsync() everything to see if it all really worked. If we PANIC if that gives us a bad answer, and PANIC on fsync errors, we guard against the great majority of these sorts of should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data errors.

Even then, good luck getting those events from an NFS volume in which the backing volume experiences an issue.

And it's kind of moot because AFAICS no such interface exists.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-09 08:45:40

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future. But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the really last thing user space expects is for this to happen and return no error.)


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 10:50:41

On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack. Mind you, I am not claiming that this contract is either documented or necessarily respected (in fact there have been studies on the error propagation and handling of the block layer, see [1]). Let us assume that this is the design contract though (which appears to be the case across a number of open-source kernels), and if not - it's a bug. In this case, indeed the specific write()s will never be able to succeed in the future, at least not as long as the BIOs are allocated to the specific failing LBAs.

But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases where one may encounter localized unrecoverable medium errors (specific to certain LBAs) that are non-maskable from the block layer and below. That does not mean that the device is dropped at all, so it does make sense to continue all other operations to all other regions of the device that are functional. In cases of total device failure, then the filesystem will prevent you from proceeding anyway.

But just to underline the point. "pointless to keep them dirty" is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the really last thing user space expects is for this to happen and return no error.)

Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back. This in turn means that there needs to be an association between the fsync() caller and the specific dirtied pages that the caller intents to drain (for which we'd need an fsync_range(), among other things). BTW, currently the failed writebacks are not dropped from memory, but rather marked clean. They could be lost though due to memory pressure or due to explicit request (e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

[1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf


From:Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
Date:2018-04-09 12:03:28

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 12:16:38

On 9 April 2018 at 18:50, Anthony Iliopoulos wrote:

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That's what Pg appears to assume now, yes.

Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync worker thread pool in backends, so we don't block the main thread. Frankly that might be a nice thing to have in the checkpointer anyway. But it's out of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once that's done, the next step will be to force fsync on close() in md and see how we go with that.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 12:31:27

On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them. Good luck convincing any OS kernel upstream to go with this design.

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.

No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a "portable" fashion.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 12:54:16

On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

I see what you are saying. So basically you'd always maintain the notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to). The kernel wouldn't even have to maintain per-page bits to trace the errors, since they will be consumed by the process that reads the events (or discarded, when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal with synchronizing writers and related issues (since this would be merely a notification mechanism - not prevent any process from continuing), which I understand would be rather intrusive for the current Pg multi-process design.

But other than that, similarly this interface could in principle be similarly implemented in the BSDs via kqueue(), I suppose, to provide what you need.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:33:18

On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate after some of the writes failed and the data got lost. Because if it can't, then recovering the memory is rather useless. It might be better to stop the system in that case, forcing the system administrator to resolve the issue somehow (fail-over to a replica, perform recovery from the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Good luck convincing any OS kernel upstream to go with this design.

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for user-space developers to build reliable systems. We have tried using it, and unfortunately the answers seems to be "no" and "Use direct I/O and manage everything on your own!"

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.

No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.

We can (and do) take care of the atomicity and isolation. Implementation of those parts is obviously very application-specific, and we have WAL and locks for that purpose. I/O on the other hand seems to be a generic service provided by the OS - at least that's how we saw it until now.

It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a "portable" fashion.

Users ask us about this quite often, actually. The question is usually about "RAW devices" and performance, but ultimately it boils down to buffered vs. direct I/O. So far our answer was we rely on kernel to do this reliably, because they know how to do that correctly and we simply don't have the manpower to implement it (portable, reliable, handling different types of storage, ...).

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:42:35

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.


From:Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Date:2018-04-09 13:47:03

At 2018-04-09 15:42:35 +0200, tomas(dot)vondra(at)2ndquadrant(dot)com wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way.

Not least because Craig's tests showed that you can't rely on always getting an error message in the logs.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:54:19

On 04/09/2018 04:00 AM, Craig Ringer wrote:

On 9 April 2018 at 07:16, Andres Freund <andres(at)anarazel(dot)de

I think the danger presented here is far smaller than some of the statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

Yeah, it clearly isn't the case that everything we do suddenly got pointless. It's fairly annoying, though.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

Right. Write errors are fairly rare, and we've probably ignored a fair number of cases demonstrating this issue. It kinda reminds me the wisdom that not seeing planes with bullet holes in the engine does not mean engines don't need armor [1].

[1] https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 14:22:06

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-09 15:29:36

On 9 April 2018 at 15:22, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to control how much memory is allowed to be full of dirty buffers in the first place to ensure that the system won't get memory starved if it can't clean them fast enough. That isn't even about persistent hardware errors. Even when the hardware is working perfectly it can only flush buffers so fast. The whole point of the kernel is to abstract away shared resources. It's not like user space has any better view of the situation here. If Postgres implemented all this in DIRECT_IO it would have exactly the same problem only with less visibility into what the rest of the system is doing. If every application implemented its own buffer cache we would be back in the same boat only with a fragmented memory allocation.

This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error.

I still think we're speaking two different languages. There's no application anywhere that's going to "clear the error". The application has done the writes and if it's calling fsync it wants to wait until the filesystem can arrange for the write to be persisted. If the application could manage without the persistence then it wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes succeed. There's no reason to think that wouldn't be possible sometime. The filesystem could remap blocks or an administrator could replace degraded raid device components. The only thing Postgres could do to recover would be create a new file and move the data (reading from the dirty buffer in memory!) to a new file anyways so we would "clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the documentation promised and it's what Postgres always expected. It sounds like the kernel implementors looked at it as some kind of communication channel to communicate status report for specific writes back to user-space. That's a much more complex problem and would have entirely different interface. I think this is why we're having so much difficulty communicating.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I doubt there are many applications that would be happy to consider their writes "persisted" when the file was corrupt. Ironically the earlier discussion quoted talked about how applications that wanted more granular communication would be using O_DIRECT -- but what we have is fsync trying to be too granular such that it's impossible to get any strong guarantees about anything with it.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO).

Honestly I don't think there's any way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-09 16:45:00

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.

Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-09 17:26:24

On 04/09/2018 09:45 AM, Robert Haas wrote:

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.

I don't have a better option but whatever we do, it should be an optional (GUC) change. We have plenty of YEARS of people not noticing this issue and Robert's correct, if we go back to an era of things like stalls it is going to look bad on us no matter how we describe the problem.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-09 18:02:21

On 09. 04. 2018 15:42, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

regards

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel module with kprobe/kretprobe capturing the file passed to fsync or even byte range within file and corresponding return value shouldn't be that hard. Kprobe has been a part of Linux kernel for a really long time, and from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync error log notification" via whatever is the most sane way to get this out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4), using bcc-tools[1] should enable you to write a quick script to get exactly that info via perf events[2].

Obviously, that's a stopgap solution ...

[1] https://github.com/iovisor/bcc [2] https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 18:29:42

On Apr 9, 2018, at 10:26 AM, Joshua D. Drake wrote:

We have plenty of YEARS of people not noticing this issue

I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-09 19:02:11

On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas wrote:

Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.

Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.

What about the bug we fixed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e ? Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres. Craig's proposal wouldn't cover this case, because there was no opportunity to run fsync() after the first crash, and there's now no way to go back and fsync() any stuff we didn't fsync() before, because the kernel may have already thrown away the error state, or may lie to us and tell us everything is fine (because our new fd wasn't opened early enough). I can't find the original discussion that led to that commit right now, so I'm not exactly sure what scenarios we were thinking about. But I think it would at least be a problem if full_page_writes=off or if you had previously started the server with fsync=off and now wish to switch to fsync=on after completing a bulk load or similar. Recovery can read a page, see that it looks OK, and continue, and then a later fsync() failure can revert that page to an earlier state and now your database is corrupted -- and there's absolute no way to detect this because write() gives you the new page contents later, fsync() doesn't feel obliged to tell you about the error because your fd wasn't opened early enough, and eventually the write can be discarded and you'll revert back to the old page version with no errors ever being reported anywhere.

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

What's being presented to us as the API contract that we should expect from buffered I/O is that if you open a file and read() from it, call fsync(), and get no error, the kernel may nevertheless decide that some previous write that it never managed to flush can't be flushed, and then revert the page to the contents it had at some point in the past. That's mostly or less equivalent to letting a malicious adversary randomly overwrite database pages plausible-looking but incorrect contents without notice and hoping you can still build a reliable system. You can avoid the problem if you can always open an fd for every file you want to modify before it's written and hold on to it until after it's fsync'd, but that's pretty hard to guarantee in the face of kill -9.

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:13:14

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:22:58

On 04/09/2018 08:29 PM, Mark Dilger wrote:

On Apr 9, 2018, at 10:26 AM, Joshua D. Drake wrote: We have plenty of YEARS of people not noticing this issue

I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.

Yeah. There's a big difference between not noticing an issue because it does not happen very often vs. attributing it to something else. If we had the ability to revisit past data corruption cases, we would probably discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment changes significantly - things like thin provisioning are likely to get even more common, increasing the incidence of these issues.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-09 19:25:33

On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

+1

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:26:21

On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:

Honestly I don't think there's any way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear the error), and that's a nasty bug that apparently went unnoticed for a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at least this issue, but similar fixes need to be adopted by many other kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC? This would probably necessitate some helper threads that mask the latency and present an async interface to the rest of PG, but sounds less intrusive than going for DIO.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:29:16

On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:

What about having buffered IO with implied fsync() atomicity via O_SYNC?

You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:37:03

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.


From:Justin Pryzby <pryzby(at)telsasoft(dot)com>
Date:2018-04-09 19:41:19

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:44:31

On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:

On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:

What about having buffered IO with implied fsync() atomicity via O_SYNC?

You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for all those kernels that don't keep the pages around, irrespective of future fixes. Short of loading a kernel module and dealing with the problem directly, the only other available options seem to be either O_SYNC, O_DIRECT or ignoring the issue.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:47:44

On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).

I don't quite see how this is any different from other possible issues when running multiple applications on the same system. One application can generate a lot of dirty data, reaching dirty_bytes and forcing the other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will resolve itself once the dirty pages get written to storage. In case of an I/O issue, it is a permanent impact - it will not resolve itself unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something that says "drop dirty pages for these files" after the application gets killed or something. That makes sense, of course.

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

In my experience when you have a persistent I/O error on a device, it likely affects all applications using that device. So unmounting the fs to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting applications don't really want to be responsible for recovering (cleanup or dirty pages etc.). We're more than happy to hand that over to kernel, e.g. because each kernel will do that differently. What we however do want is reliable information about fsync outcome, which we need to properly manage WAL, checkpoints etc.

Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.

Right. What I was getting to is that perhaps the current fsync() behavior is not very practical for building actual applications.

Best regards, Anthony

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Thanks. The paper looks interesting.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:51:12

On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.

As discussed before, I think this could be acceptable, especially if you pair it with an opt-in mechanism (only applications that care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel releases that are eating fsync() writeback errors even over sync().


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:54:05

On 04/09/2018 09:37 PM, Andres Freund wrote:

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Of course, it's also possible to do what you suggested, and simply mark the inode as failed. In which case the next fsync can't possibly retry the writes (e.g. after freeing some space on thin-provisioned system), but we'd get reliable failure mode.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:59:34

On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Even better, we could do so via a dedicated worker. That'd quite possibly end up as a performance benefit.

I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..

I think it'd be fair to open PG_VERSION of all created tablespaces. Would require some hangups to signal checkpointer (or whichever process) to do so when creating one, but it shouldn't be too hard. Some people would complain because they can't do some nasty hacks anymore, but it'd also save peoples butts by preventing them from accidentally unmounting.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:04:20

Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful expectation.

We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).

What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.

I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 20:25:54

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 20:30:00

On 04/09/2018 10:04 PM, Andres Freund wrote:

Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful expectation.

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

And most importantly, it's rather delusional to think the kernel developers are going to be enthusiastic about that approach ...

We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).

Sure. And it's likely way less invasive from kernel perspective.

What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.

IMHO it'd be enough if a remount clears it.

I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.

Hmmmm, that's an interesting idea, and it's about the only thing that would help us on older kernels. There's a wb_err in adress_space, but that's at inode level. Not sure if there's something at fs level.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:34:15

Hi,

On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.

I don't see that as a real problem here. For one the problematic scenarios shouldn't readily apply, for another WAL is checksummed.

There's the problem that a new basebackup would potentially become corrupted however. And similarly pg_rewind.

Note that I'm not saying that we and/or linux shouldn't change anything. Just that the apocalypse isn't here.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

I think that's basically right. There's cases where corruption could get propagated, but they're not straightforward.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:37:31

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 20:43:03

On 04/09/2018 10:25 PM, Mark Dilger wrote:

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?

That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL

The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 20:55:29

On Apr 9, 2018, at 1:43 PM, Tomas Vondra wrote:

On 04/09/2018 10:25 PM, Mark Dilger wrote:

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?

That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL

The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption. When choosing to have one standby, or two standbys, or ten standbys, one needs to be able to assume a certain amount of statistical independence between failures on one server and failures on another. If they are tightly correlated dependent variables, then the conclusion that the probability of all nodes failing simultaneously is vanishingly small becomes invalid.

From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 21:08:29

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 21:25:52

On 04/09/2018 11:08 PM, Andres Freund wrote:

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.

In any case, that certainly does not count as data corruption spreading from the master to standby.


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 21:33:29

On Apr 9, 2018, at 2:25 PM, Tomas Vondra wrote:

On 04/09/2018 11:08 PM, Andres Freund wrote:

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.

I'm happy to take your word for that.

In any case, that certainly does not count as data corruption spreading from the master to standby.

Maybe not from the point of view of somebody looking at the code. But a user might see it differently. If the data being loaded into the master and getting replicated to the standby "causes" both to get corrupt, then it seems like corruption spreading. I put "causes" in quotes because there is some argument to be made about "correlation does not prove cause" and so forth, but it still feels like causation from an arms length perspective. If there is a pattern of standby servers tending to fail more often right around the time that the master fails, you'll have a hard time comforting users, "hey, it's not technically causation." If loading data into the master causes the master to hit ENOSPC, and replicating that data to the standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my standby servers right soon now.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-09 22:33:16

On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Right. For anyone interested, here is the change you mentioned, and an interesting one that came a bit earlier last year:

Retrying may well be futile, but at least future fsync() calls won't report success bogusly. There may of course be more space-efficient ways to represent that state as the comment implies, while never lying to the user -- perhaps involving filesystem level or (pinned) inode level errors that stop all writes until unmounted. Something tells me they won't resort to flakey fsync() error reporting.

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Very interesting, thanks.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-10 00:32:20

On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro wrote:

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

I created a wiki page to track what we know (or think we know) about fsync() on various operating systems:

https://wiki.postgresql.org/wiki/Fsync_Errors

If anyone has more information or sees mistakes, please go ahead and edit it.


From:Andreas Karlsson <andreas(at)proxel(dot)se>
Date:2018-04-10 00:41:10

On 04/09/2018 02:16 PM, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:44:59

On 10 April 2018 at 03:59, Andres Freund wrote:

On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it needed to reopen it sometimes, or just make sure to keep the oldest copy of the fd when a backend reopens multiple times, but that's no biggie.

We'd still have to fsync() out early in the checkpointer if we ran out of space in our FD list, and initscripts would need to change our ulimit or we'd have to do it ourselves in the checkpointer. But neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a pretty big corner with the spread of overcommitted, dedupliating SANs, cloud storage, etc. Not all I/O errors indicate permanent hardware faults, disk failures, etc, as I outlined earlier. I'm very curious to know what AWS EBS's error semantics are, and other cloud network block stores. (I posted on Amazon forums https://forums.aws.amazon.com/thread.jspa?threadID=279274&tstart=0 but nothing so far).

I'm also not particularly inclined to trust that all file systems will always reliably reserve space without having some cases where they'll fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the direction the storage world is moving in, and whether this will become a bigger issue over time.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-10 01:52:21

On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer wrote:

On 10 April 2018 at 03:59, Andres Freund wrote:

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.

How would that interlock with concurrent checkpoints?

I can see how to make that work if the share-fd-or-fsync-now logic happens in smgrwrite() when called by FlushBuffer() while you hold io_in_progress, but not if you defer it to some random time later.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:54:30

On 10 April 2018 at 04:25, Mark Dilger wrote:

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make subsequent WAL writes based on that wrong information. But that's actually OK, since the standby will have replayed the original WAL correctly.

I think the only time we'd run into trouble is if we evict the good (but not written out) data from s_b and the fs buffer cache, then later read in the old version of a block we failed to overwrite. Data checksums (if enabled) might catch it unless the write left the whole block stale. In that case we might generate a full page write with the stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is caught promptly, and less so over time. But AFAICS WAL-based replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of overprovisioning, is on a non-network file system, and doesn't use multipath (or sets it up right) this issue is exceptionally unlikely to affect you.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:59:03

On 10 April 2018 at 04:37, Andres Freund wrote:

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-10 02:00:59

On April 9, 2018 6:59:03 PM PDT, Craig Ringer wrote:

On 10 April 2018 at 04:37, Andres Freund wrote:

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.

How's that not solved by pre zeroing and/or fallocate as I suggested above?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 02:02:48

On 10 April 2018 at 08:41, Andreas Karlsson wrote:

On 04/09/2018 02:16 PM, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?

We'd examine the notification queue only once all our checkpoint fsync()s had succeeded, and before we updated the control file to advance the redo position.

I'm intrigued by the suggestion upthread of using a kprobe or similar to achieve this. It's a horrifying unportable hack that'd make kernel people cry, and I don't know if we have any way to flush buffered probe data to be sure we really get the news in time, but it's a cool idea too.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-10 05:04:13

On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 05:37:19

On 10 April 2018 at 13:04, Michael Paquier wrote:

On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-10 06:10:21

On Tue, Apr 10, 2018 at 01:37:19PM +0800, Craig Ringer wrote:

On 10 April 2018 at 13:04, Michael Paquier wrote:

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

Sure.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 12:15:15

On 10 April 2018 at 14:10, Michael Paquier wrote:

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.

For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.

AttachmentContent-TypeSize v1-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchtext/x-patch10.3 KB


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-10 15:15:46

On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think it's essential. The only argument for the current Linux behavior that has been so far advanced on this thread, at least as far as I can see, is that if it kept retrying the buffers forever, it would be pointless and might run the machine out of memory, so we might as well discard them. But previous comments have already illustrated that the kernel is not really up against a wall there -- it could put individual inodes into a permanent failure state when it discards their dirty data, as you suggested, or it could do what others have suggested, and what I think is better, which is to put the whole filesystem into a permanent failure state that can be cleared by remounting the FS. That could be done on an as-needed basis -- if the number of dirty buffers you're holding onto for some filesystem becomes too large, put the filesystem into infinite-fail mode and discard them all. That behavior would be pretty easy for administrators to understand and would resolve the entire problem here provided that no PostgreSQL processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error. If the drive is beyond all hope, then it doesn't really make sense to talk about whether the database stored on it is corrupt. In general we can't be sure that we'll even get an error - e.g. the system could be idle and the drive could be on fire. Maybe this is the case you meant by "it'd be nice if we could report it reliably". But at least in my experience, that's typically not what's going on. You get some I/O errors and so you remount the filesystem, or reboot, or rebuild the array, or ... something. And then the errors go away and, at that point, you want to run recovery and continue using your database. In this scenario, it matters quite a bit what the error reporting was like during the period when failures were occurring. In particular, if the database was allowed to think that it had successfully checkpointed when it didn't, you're going to start recovery from the wrong place.

I'm going to shut up now because I'm telling you things that you obviously already know, but this doesn't sound like a "near irresolvable corner case". When the storage goes bonkers, either PostgreSQL and the kernel can interact in such a way that a checkpoint can succeed without all of the relevant data getting persisted, or they don't. It sounds like right now they do, and I'm not really clear that we have a reasonable idea how to fix that. It does not sound like a PANIC is sufficient.


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-10 15:28:07

On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer wrote:

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do that individually, which I suspect would slow down initdb quite a lot. Now admittedly for anybody other than a PostgreSQL developer that's only a minor issue, and our regression tests mostly run with fsync=off anyway. But I have a strong suspicion that our assumptions about how fsync() reports errors are baked into an awful lot of parts of the system, and by the time we get unbaking them I think it's going to be really surprising if we haven't done real harm to overall system performance.

BTW, I took a look at the MariaDB source code to see whether they've got this problem too and it sure looks like they do. os_file_fsync_posix() retries the fsync in a loop with an 0.2 second sleep after each retry. It warns after 100 failures and fails an assertion after 1000 failures. It is hard to understand why they would have written the code this way unless they expect errors reported by fsync() to continue being reported until the underlying condition is corrected. But, it looks like they wouldn't have the problem that we do with trying to reopen files to fsync() them later -- I spot checked a few places where this code is invoked and in all of those it looks like the file is already expected to be open.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-10 15:40:05

Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:

On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the respective linux kernel maintainers and open up this topic for more detailed discussions. LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-10 16:38:27

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack.

Well Postgres is using the filesystem. The interface between the block layer and the filesystem may indeed need to be more complex, I wouldn't know.

But I don't think "all possibilities" is a very useful concept. Neither layer here is going to be perfect. They can only promise that all possibilities that have actually been implemented have been exhausted. And even among those only to the degree they can be done automatically within the engineering tradeoffs and constraints. There will always be cases like thin provisioned devices that an operator can expand, or degraded raid arrays that can be repaired after a long operation and so on. A network device can't be sure whether a remote server may eventually come back or not and have to be reconfigured by a human or system automation tool to point to the new server or new network configuration.

Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back.

No, the interface we have is fsync which gives us that information with the granularity of a single file. The database could in theory recognize that fsync is not completing on a file and read that file back and write it to a new file. More likely we would implement a feature Oracle has of writing key files to multiple devices. But currently in practice that's not what would happen, what would happen would be a human would recognize that the database has stopped being able to commit and there are hardware errors in the log and would stop the database, take a backup, and restore onto a new working device. The current interface is that there's one error and then Postgres would pretty much have to say, "sorry, your database is corrupt and the data is gone, restore from your backups". Which is pretty dismal.

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

Postgres cannot just store the entire database in RAM. It writes things to the filesystem all the time. It calls fsync only when it needs a write barrier to ensure consistency. That's only frequent on the transaction log to ensure it's flushed before data modifications and then periodically to checkpoint the data files. The amount of data written between checkpoints can be arbitrarily large and Postgres has no idea how much memory is available as filesystem buffers or how much i/o bandwidth is available or other memory pressure there is. What you're suggesting is that the application should have to babysit the filesystem buffer cache and reimplement all of it in user-space because the filesystem is free to throw away any data any time it chooses?

The current interface to throw away filesystem buffer cache is unmount. It sounds like the kernel would like a more granular way to discard just part of a device which makes a lot of sense in the age of large network block devices. But I don't think just saying that the filesystem buffer cache is now something every application needs to re-implement in user-space really helps with that, they're going to have the same problems to solve.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-10 16:54:40

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 18:58:37

-hackers,

I reached out to the Linux ext4 devs, here is tytso(at)mit(dot)edu response:

""" Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...

- Ted

"""


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 19:51:01

-hackers,

The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?

Thanks,


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 20:57:34

On 04/10/2018 12:51 PM, Joshua D. Drake wrote:

-hackers,

The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?

Thanks to Anthony for this link:

http://lists.openwall.net/linux-ext4/2018/04/10/33

It isn't quite real time but it keeps things close enough.


From:Jonathan Corbet <corbet(at)lwn(dot)net>
Date:2018-04-11 12:05:27

On Tue, 10 Apr 2018 17:40:05 +0200 Anthony Iliopoulos wrote:

LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.

FWIW, it is my current intention to be sure that the development community is at least aware of the issue by the time LSFMM starts.

The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-11 12:23:49

On 10 April 2018 at 19:58, Joshua D. Drake wrote:

You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off.

I always wondered why Linux didn't implement umount -f. It's been in BSD since forever and it's a major annoyance that it's missing in Linux. Even without leaking memory it still leaks other resources, causes confusion and awkward workarounds in UI and automation software.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-11 14:29:09

Hi,

On 2018-04-11 06:05:27 -0600, Jonathan Corbet wrote:

The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.

If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.

Thanks for chiming in,


From:Jonathan Corbet <corbet(at)lwn(dot)net>
Date:2018-04-11 14:40:31

On Wed, 11 Apr 2018 07:29:09 -0700 Andres Freund wrote:

If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.

OK, I've dropped the PC a note; hopefully you'll be hearing from them.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:19:53

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:29:17

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

My more-considered crazy idea is to have a postgresql.conf setting like archive_command that allows the administrator to specify a command that will be run after fsync but before the checkpoint is marked as complete. While we can have write flush errors before fsync and never see the errors during fsync, we will not have write flush errors after fsync that are associated with previous writes.

The script should check for I/O or space-exhaustion errors and return false in that case, in which case we can stop and maybe stop and crash recover. We could have an exit of 1 do the former, and an exit of 2 do the later.

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

    #wal_sync_method = fsync                # the default is the first option
                                            # supported by the operating system:
                                            #   open_datasync
                                     -->    #   fdatasync (default on Linux)
                                     -->    #   fsync
                                     -->    #   fsync_writethrough
                                            #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:32:45

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it. Does O_DIRECT work in such container cases?


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-17 21:34:53

On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

    > #wal_sync_method = fsync                # the default is the first option
    >                                         # supported by the operating system:
    >                                         #   open_datasync
    >                                  -->    #   fdatasync (default on Linux)
    >                                  -->    #   fsync
    >                                  -->    #   fsync_writethrough
    >                                         #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.

O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-17 21:41:42

On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.

There's not that many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.

Does O_DIRECT work in such container cases?

Yes.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:49:42

On Mon, Apr 9, 2018 at 12:25:33PM -0700, Peter Geoghegan wrote:

On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

+1

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.

Yes, I think we need to look at user expectations here.

If the device has a hardware write error, it is true that it is good to detect it, and it might be permanent or temporary, e.g. NAS/NFS. The longer the error persists, the more likely the user will expect corruption. However, right now, any length outage could cause corruption, and it will not be reported in all cases.

Running out of disk space is also something you don't expect to corrupt your database --- you expect it to only prevent future writes. It seems NAS/NFS and any thin provisioned storage will have this problem, and again, not always reported.

So, our initial action might just be to educate users that write errors can cause silent corruption, and out-of-space errors on NAS/NFS and any thin provisioned storage can cause corruption.

Kernel logs (not just Postgres logs) should be monitored for these issues and fail-over/recovering might be necessary.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 09:52:22

On Tue, Apr 17, 2018 at 02:34:53PM -0700, Andres Freund wrote:

On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

> > #wal_sync_method = fsync                # the default is the first option
> >                                         # supported by the operating system:
> >                                         #   open_datasync
> >                                  -->    #   fdatasync (default on Linux)
> >                                  -->    #   fsync
> >                                  -->    #   fsync_writethrough
> >                                         #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.

Well, we have have been focusing on the delay between backend or checkpoint writes and checkpoint fsyncs. My point is that we have the same problem in doing a write, then fsync for the WAL. Yes, the delay is much shorter, but the issue still exists. I realize that newer Linux kernels will not have the problem since the file descriptor remains open, but the problem exists with older/common linux kernels.

O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.

Uh, as doesn't 'open_datasync' and 'open_sync' fsync as part of the write, meaning we can't lose the error report like we can with the others?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 10:04:30

On 18 April 2018 at 05:19, Bruce Momjian wrote:

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.

It should be sent if you're using sync mode.

From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to cover that properly.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 10:19:28

On 10 April 2018 at 20:15, Craig Ringer wrote:

On 10 April 2018 at 14:10, Michael Paquier wrote:

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.

For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.

Thinking about this some more, it'll definitely need a GUC to force it to continue despite a potential hazard. Otherwise we go backwards from the status quo if we're in a position where uptime is vital and correctness problems can be tolerated or repaired later. Kind of like zero_damaged_pages, we'll need some sort of continue_after_fsync_errors .

Without that, we'll panic once, enter redo, and if the problem persists we'll panic in redo and exit the startup process. That's not going to help users.

I'll amend the patch accordingly as time permits.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 11:46:15

On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:

On 18 April 2018 at 05:19, Bruce Momjian wrote:

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.

It should be sent if you're using sync mode.

From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to cover that properly.

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

So, what about thin provisioning? I can understand sharing free space among file systems, but once a write arrives I assume it reserves the space. Is the problem that many thin provisioning systems don't have a sync mode, so you can't force the write to appear on the device before an fsync?


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 11:56:57

On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:

On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.

I was thinking 'dmesg', but the result is similar.

There's not that many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.

It is hard to know exactly what the message format should be for each operating system because it is hard to generate them on demand, and we would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and exit, you don't want to see that message again after the problem has been fixed and the server restarted. The simplest solution is to save the output of the last check and look for only new entries. I am attaching a script I run every 15 minutes from cron that emails me any unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for various operating systems.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel messages. I think you can view them in containers, but in virtual machines you might not have access to the host operating system's kernel messages, and that might be where they are.

    AttachmentContent-TypeSize
    dmesg_checktext/plain574 bytes

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 12:45:53

wrOn 18 April 2018 at 19:46, Bruce Momjian wrote:

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to allocate the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.

... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.

So, what about thin provisioning? I can understand sharing free space among file systems

Most thin provisioning is done at the block level, not file system level. So the FS is usually unaware it's on a thin-provisioned volume. Usually the whole kernel is unaware, because the thin provisioning is done on the SAN end or by a hypervisor. But the same sort of thing may be done via LVM - see lvmthin. For example, you may make 100 different 1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of 50TB of concrete physical capacity. The SAN is doing block mapping and only allocating storage chunks to a given volume when the FS has written blocks to every previous free block in the previous storage chunk. It may also do things like block de-duplication, compression of storage chunks that aren't written to for a while, etc.

The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage even if you don't know it.

Think of it as a bit like VM overcommit, for storage. You can malloc() as much memory as you like and everything's fine until you try to actually use it. Then you go to dirty a page, no free pages are available, and boom.

The thing is, the SAN (or LVM) doesn't have any idea about the FS's internal in-memory free space counter and its space reservations. Nor does it understand any FS metadata. All it cares about is "has this LBA ever been written to by the FS?". If so, it must make sure backing storage for it exists. If not, it won't bother.

Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.

I don't know if posix_fallocate is a sufficient safeguard either. You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say

After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space.

... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.

It's reasonable enough to throw up our hands in this case and say "your setup is crazy, you're breaking the rules, don't do that". The truth is they AREN'T breaking the rules, but we can disclaim support for such configurations anyway.

After all, we tell people not to use Linux's VM overcommit too. How's that working for you? I see it enabled on the great majority of systems I work with, and some people are very reluctant to turn it off because they don't want to have to add swap.

If someone has a 50TB SAN and wants to allow for unpredictable space use expansion between various volumes, and we say "you can't do that, go buy a 100TB SAN instead" ... that's not going to go down too well either. Often we can actually say "make sure the 5TB volume PostgreSQL is using is eagerly provisioned, and expand it at need using online resize if required. We don't care about the rest of the SAN.".

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

There are file systems optimised for thin provisioning, etc, too. But that's more commonly done by having them do things like zero deallocated space so the thin provisioning system knows it can return it to the free pool, and now things like DISCARD provide much of that signalling in a standard way.


From:Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Date:2018-04-18 23:31:50

On 19/04/18 00:45, Craig Ringer wrote:

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-19 00:44:33

On 19 April 2018 at 07:31, Mark Kirkwood wrote:

On 19/04/18 00:45, Craig Ringer wrote:

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).

This still makes a lot of sense on AWS EBS, particularly when using a volume created from a non-empty snapshot. Performance of S3-snapshot based EBS volumes is spectacularly awful, since they're copy-on-read. Reading the whole volume helps a lot.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-20 20:49:08

On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:

wrOn 18 April 2018 at 19:46, Bruce Momjian wrote:

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

Thanks.

The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to allocate the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

" A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. "

... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel could have flushed the dirty pages to storage, and failed, and the fsync would later succeed. I realize newer kernels have that fixed for files open during that operation, but that is the minority of installs.

The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage even if you don't know it.

Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write space from its free space, but the free space doesn't all exist. I am not sure how we can tell people to make sure the file system free space is real.

You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say

" After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. "

... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.

Frankly, in what cases will a write fail for lack of free space? It could be a new WAL file (not recycled), or a pages added to the end of the table.

Is that it? It doesn't sound too terrible. If we can eliminate the corruption due to free space exxhaustion, it would be a big step forward.

The next most common failure would be temporary storage failure or storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about that.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-21 19:21:39

Just for the record, I tried the test case with ZFS on Ubuntu 17.10 host with ZFS on Linux 0.6.5.11.

ZFS does not swallow the fsync error, but the system does not handle the error nicely: the test case program hangs on fsync, the load jumps up and there's a bunch of z_wr_iss and z_null_int kernel threads belonging to zfs, eating up the CPU.

Even then I managed to reboot the system, so it's not a complete and utter mess.

The test case adjustments are here: https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c

Kind regards,


On 29. 03. 2018 07:25, Craig Ringer wrote:

On 29 March 2018 at 13:06, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

> What I'm failing to grok here is how that error flag even matters,
> whether it's a single bit or a counter as described in that patch.  If
> write back failed, *the page is still dirty*.  So all future calls to
> fsync() need to try to try to flush it again, and (presumably) fail
> again (unless it happens to succeed this time around).

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-23 20:14:48

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-24 00:09:23

On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-26 02:16:52

On 24 April 2018 at 04:14, Andres Freund wrote:

I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.

Interim LWN.net coverage of that can be found here: https://lwn.net/Articles/752613/


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-27 01:18:55

On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian wrote:

On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.

It seems like some implementations flush on close and therefore discover ENOSPC problem at that point, unless they have NVSv4 (RFC 3050) "write delegation" with a promise from the server that a certain amount of space is available. It seems like you can't count on that in any way though, because it's the server that decides when to delegate and how much space to promise is preallocated, not the client. So in userspace you always need to be able to handle errors including ENOSPC returned by close(), and if you ignore that and you're using an operating system that immediately incinerates all evidence after telling you that (so that later fsync() doesn't fail), you're in trouble.

Some relevant code:

It looks like the bleeding edge of the NFS spec includes a new ALLOCATE operation that should be able to support posix_fallocate() (if we were to start using that for extending files):

https://tools.ietf.org/html/rfc7862#page-64

I'm not sure how reliable [posix_]fallocate is on NFS in general though, and it seems that there are fall-back implementations of posix_fallocate() that write zeros (or even just feign success?) which probably won't do anything useful here if not also flushed (that fallback strategy might only work on eager reservation filesystems that don't have direct fallocate support?) so there are several layers (libc, kernel, nfs client, nfs server) that'd need to be aligned for that to work, and it's not clear how a humble userspace program is supposed to know if they are.

I guess if you could find a way to amortise the cost of extending (like Oracle et al do by extending big container datafiles 10MB at a time or whatever), then simply writing zeros and flushing when doing that might work out OK, so you wouldn't need such a thing? (Unless of course it's a COW filesystem, but that's a different can of worms.)


This thread continues on the ext4 mailing list:


From:   "Joshua D. Drake" <[email protected]>
Subject: fsync() errors is unsafe and risks data loss
Date:   Tue, 10 Apr 2018 09:28:15 -0700

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#[email protected]

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.


From:   "Darrick J. Wong" <[email protected]>
Date:   Tue, 10 Apr 2018 09:54:43 -0700

On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#[email protected]

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

You might try the XFS list ([email protected]) seeing as the initial complaint is against xfs behaviors...


From:   "Joshua D. Drake" <[email protected]>
Date:   Tue, 10 Apr 2018 09:58:21 -0700

On 04/10/2018 09:54 AM, Darrick J. Wong wrote:

On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#[email protected]

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

You might try the XFS list ([email protected]) seeing as the initial complaint is against xfs behaviors...

Later in the thread it becomes apparent that it applies to ext4 (NFS too) as well. I picked ext4 because I assumed it is the most populated of the lists since its the default filesystem for most distributions.


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Tue, 10 Apr 2018 14:43:56 -0400

Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...


From:   Andreas Dilger <[email protected]>
Date:   Tue, 10 Apr 2018 13:44:48 -0600

On Apr 10, 2018, at 10:50 AM, Joshua D. Drake [email protected] wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#[email protected]

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

Yes, this is a very long thread. The summary is Postgres is unhappy that fsync() on Linux (and also other OSes) returns an error once if there was a prior write() failure, instead of keeping dirty pages in memory forever and trying to rewrite them.

This behaviour has existed on Linux forever, and (for better or worse) is the only reasonable behaviour that the kernel can take. I've argued for the opposite behaviour at times, and some subsystems already do limited retries before finally giving up on a failed write, though there are also times when retrying at lower levels is pointless if a higher level of code can handle the failure (e.g. mirrored block devices, filesystem data mirroring, userspace data mirroring, or cross-node replication).

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on "I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!" without looking at the larger picture of what is practical to change and where the issue should best be fixed.

Regardless of why this is the case, the net is that PG needs to deal with all of the systems that currently exist that have this behaviour, even if some day in the future it may change (though that is unlikely). It seems ironic that "keep dirty pages in userspace until fsync() returns success" is totally unacceptable, but "keep dirty pages in the kernel" is fine. My (limited) understanding of databases was that they preferred to cache everything in userspace and use O_DIRECT to write to disk (which returns an error immediately if the write fails and does not double buffer data).


From: Martin Steigerwald [email protected] Date: Tue, 10 Apr 2018 21:47:21 +0200

Hi Theodore, Darrick, Joshua.

CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as well, Theodore).

Theodore Y. Ts'o - 10.04.18, 20:43:

This isn't actually an ext4 issue, but a long-standing VFS/MM issue. […] First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Guh. I was not aware of this. I knew consumer-grade SSDs often do not have power loss protection, but still thought they´d handle garble collection in an atomic way. Sometimes I am tempted to sing an "all hardware is crap" song (starting with Meltdown/Spectre, then probably heading over to storage devices and so on… including firmware crap like Intel ME).

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

From the original PostgreSQL mailing list thread I did not get on how exactly FreeBSD differs in behavior, compared to Linux. I am aware of one operating system that from a user point of view handles this in almost the right way IMHO: AmigaOS.

When you removed a floppy disk from the drive while the OS was writing to it it showed a "You MUST insert volume somename into drive somedrive:" and if you did, it just continued writing. (The part that did not work well was that with the original filesystem if you did not insert it back, the whole disk was corrupted, usually to the point beyond repair, so the "MUST" was no joke.)

In my opinion from a user´s point of view this is the only sane way to handle the premature removal of removable media. I have read of a GSoC project to implement something like this for NetBSD but I did not check on the outcome of it. But in MS-DOS I think there has been something similar, however MS-DOS is not an multitasking operating system as AmigaOS is.

Implementing something like this for Linux would be quite a feat, I think, cause in addition to the implementation in the kernel, the desktop environment or whatever other userspace you use would need to handle it as well, so you´d have to adapt udev / udisks / probably Systemd. And probably this behavior needs to be restricted to anything that is really removable and even then in order to prevent memory exhaustion in case processes continue to write to an removed and not yet re-inserted USB harddisk the kernel would need to halt I/O processes which dirty I/O to this device. (I believe this is what AmigaOS did. It just blocked all subsequent I/O to the device still it was re-inserted. But then the I/O handling in that OS at that time is quite different from what Linux does.)

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I was not aware that flash based media may be as crappy as you hint at.

From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB stack the above mechanism worked even with USB sticks. I however did not test this often and I did not check for data corruption after a test.


From:   Andres Freund <[email protected]>
Date:   Tue, 10 Apr 2018 15:07:26 -0700

(Sorry if I screwed up the thread structure - I'd to reconstruct the reply-to and CC list from web archive as I've not found a way to properly download an mbox or such of old content. Was subscribed to fsdevel but not ext4 lists)

Hi,

2018-04-10 18:43:56 Ted wrote:

I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

Same ;)

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

2018-04-10 18:43:56 Ted wrote:

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file writtten. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

2018-04-10 19:44:48 Andreas wrote:

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

2018-04-10 18:43:56 Ted wrote:

I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on "I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!" without looking at the larger picture of what is practical to change and where the issue should best be fixed.

Everone can participate in discussions...


From:   Andreas Dilger <[email protected]>
Date:   Wed, 11 Apr 2018 15:52:44 -0600

On Apr 10, 2018, at 4:07 PM, Andres Freund [email protected] wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error. If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.

There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.

2018-04-10 18:43:56 Ted wrote:

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

Consider if there was a per-inode "there was once an error writing this inode" flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.

IMHO, the only alternative would be to keep the dirty pages in memory until they are written to disk. If that was not possible, what then? It would need a reboot to clear the dirty pages, or truncate the file (discarding all data)?

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is "re-opened", and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.

That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO;

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others. Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

That said, even if a fix was available for Linux tomorrow, it would be years before a majority of users would have it available on their system, that includes even the errseq mechanism that was landed a few months ago. That implies to me that you'd want something that fixes PG now so that it works around whatever (perceived) breakage exists in the Linux fsync() implementation. Since the thread indicates that non-Linux kernels have the same fsync() behaviour, it makes sense to do that even if the Linux fix was available.

2018-04-10 19:44:48 Andreas wrote:

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

I can't say how common or uncommon such a workload is, though PG is the only application that I've heard of doing it, and I've been working on filesystems for 20 years. I'm a bit surprised that anyone expects fsync() on a newly-opened fd to have any state from write() calls that predate the open. I can understand fsync() returning an error for any IO that happens within the context of that fsync(), but how far should it go back for reporting errors on that file? Forever? The only way to clear the error would be to reboot the system, since I'm not aware of any existing POSIX code to clear such an error


From:   Dave Chinner <[email protected]>
Date:   Thu, 12 Apr 2018 10:09:16 +1000

On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote: > On Apr 10, 2018, at 4:07 PM, Andres Freund [email protected] wrote: > > 2018-04-10 18:43:56 Ted wrote: > >> So for better or for worse, there has not been as much investment in > >> buffered I/O and data robustness in the face of exception handling of > >> storage devices. > > > > That's a bit of a cop out. It's not just databases that care. Even more > > basic tools like SCM, package managers and editors care whether they can > > proper responses back from fsync that imply things actually were synced. > > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing > to thousands(?) of files, closing the file descriptors, then expecting > fsync() on a newly-opened fd to return a historical error.

Yeah, this seems like a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation.

And speaking of "behaving differently to expectations", nobody has mentioned that close() can also return write errors. Hence if you do write - close - open - fsync the the write error might get reported on close, not fsync. IOWs, the assumption that "async writeback errors will persist across close to open" is fundamentally broken to begin with. It's even documented as a slient data loss vector in the close(2) man page:

$ man 2 close
.....
   Dealing with error returns from close()

	  A careful programmer will check the return value of
	  close(), since it is quite possible that  errors  on  a
	  previous  write(2)  operation  are reported  only on the
	  final close() that releases the open file description.
	  Failing to check the return value when closing a file may
	  lead to silent loss of data.  This can especially be
	  observed with NFS and with disk quota.

Yeah, ensuring data integrity in the face of IO errors is a really hard problem. :/

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do [complex things PG needs]" discussion.

In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.

This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....


From:   Andres Freund <[email protected]>
Date:   Wed, 11 Apr 2018 19:17:52 -0700

On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:

On Apr 10, 2018, at 4:07 PM, Andres Freund [email protected] wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too:

  /* We want to guarantee the extracted files are on the disk, so that the
   * subsequent renames to the info database do not end up with old or zero
   * length files in case of a system crash. As neither dpkg-deb nor tar do
   * explicit fsync()s, we have to do them here.
   * XXX: This could be avoided by switching to an internal tar extractor. */
  dir_sync_contents(cidir);

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.

I think there's some legitimate arguments that a database should use direct IO (more on that as a reply to David), but claiming that all sorts of random utilities need to use DIO with buffering etc is just insane.

If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.

Except that they won't notice that they got a failure, at least in the dpkg case. And happily continue installing corrupted data

There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.

Yea, I agree that'd not be sane. As far as I understand the dpkg code (all of 10min reading it), that'd also be unnecessary. It can abort the installation, but only if it detects the error. Which isn't happening.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

Or even more extreme, you untar/zip/git clone a directory. Then do a sync. And you don't know whether anything actually succeeded.

Consider if there was a per-inode "there was once an error writing this inode" flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.

The data in the file also is corrupt. Having to unmount or delete the file to reset the fact that it can't safely be assumed to be on disk isn't insane.

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is "re-opened", and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.

Except that postgres uses multiple processes. And works on a lot of architectures. If we started to fsync all opened files on process exit our users would lynch us. We'd need a complicated scheme that sends processes across sockets between processes, then deduplicate them on the receiving side, somehow figuring out which is the oldest filedescriptors (handling clockdrift safely).

Note that it'd be perfectly fine that we've "thrown away" the buffer contents if we'd get notified that the fsync failed. We could just do WAL replay, and restore the contents (just was we do after crashes and/or for replication).

That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.

There's already a per-process cache of open files.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO;

Well, I'm making that argument because several people argued that throwing away buffer contents in this case is the only way to not cause OOMs, and that that's incompatible with reporting errors. It's clearly not...

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others.

Sure.

Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

I don't think this is that PG specific, as explained above.


From:   Andres Freund <[email protected]>
Date:   Wed, 11 Apr 2018 19:32:21 -0700

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do [complex things PG needs]" discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.

This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....

Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.


From:   Andres Freund <[email protected]>
Date:   Wed, 11 Apr 2018 19:51:13 -0700

Hi,

On 2018-04-11 19:32:21 -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

And before somebody argues that that's a too small window to trigger the problem realistically: Restoring large databases happens pretty commonly (for new replicas, testcases, or actual fatal issues), takes time, and it's where a lot of storage is actually written to for the first time in a while, so it's far from unlikely to trigger bad block errors or such.


From:   Matthew Wilcox <[email protected]>
Date:   Wed, 11 Apr 2018 20:02:48 -0700

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 01:09:24 -0400

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.

I know it's not what PG has chosen, but realistically all of the other major databases and userspace based storage systems have used DIO precisely because it's the way to avoid OS-specific behavior or require OS-specific code. DIO is simple, and pretty much the same everywhere.

In contrast, the exact details of how buffered I/O workrs can be quite different on different OS's. This is especially true if you take performance related details (e.g., the cleaning algorithm, how pages get chosen for eviction, etc.)

As I read the PG-hackers thread, I thought I saw acknowledgement that some of the behaviors you don't like with Linux also show up on other Unix or Unix-like systems?


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 01:34:45 -0400

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

If there is no open file descriptor, and in many cases, no process (because it has already exited), it may be horrible, but what the h*ll else do you expect the OS to do?

The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. If it detects errors on a particular hard drive, it tells the cluster file system to stop using that disk, and to reconstruct from erasure code all of the data chunks on that disk onto other disks in the cluster. We then run a series of disk diagnostics to make sure we find all of the bad sectors (every often, where there is one bad sector, there are several more waiting to be found), and then afterwards, put the disk back into service.

By making it be a separate health monitoring process, we can have HDD experts write much more sophisticated code that can ask the disk firmware for more information (e.g., SMART, the grown defect list), do much more careful scrubbing of the disk media, etc., before returning the disk back to service.

Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

I don't think this is that PG specific, as explained above.

The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.

The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.


From:   Dave Chinner <[email protected]>
Date:   Thu, 12 Apr 2018 15:45:36 +1000

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do " discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.


From:   Lukas Czerner <[email protected]>
Date:   Thu, 12 Apr 2018 12:19:26 +0200

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.


From:   Jeff Layton <[email protected]>
Date:   Thu, 12 Apr 2018 07:09:14 -0400

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.


From:   Matthew Wilcox <[email protected]>
Date:   Thu, 12 Apr 2018 04:19:48 -0700

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

Yeah, it's definitely half-arsed. We could make further changes to improve the situation, but they'd have wider impact. For example, we can tell if the error has been sampled by any existing fd, so we could bias our inode reaping to have inodes with unreported errors stick around in the cache for longer.

I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

I suspect it isn't. If there's a transient error then we should reattempt the write. OTOH if the error is permanent then reattempting the write isn't going to do any good and it's just going to cause the drive to go through the whole error handling dance again. And what do we do if we're low on memory and need these pages back to avoid going OOM? There's a lot of options here, all of them bad in one situation or another.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.

I don't know either. It'd force the application to face up to the fact that the data is gone immediately rather than only finding it out after a reboot. Again though that might cause more problems than it solves. It's hard to know what the right thing to do is.


From:   Jeff Layton <[email protected]>
Date:   Thu, 12 Apr 2018 07:24:12 -0400

On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do " discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.

Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.

You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.


From:   Dave Chinner <[email protected]>
Date:   Thu, 12 Apr 2018 22:01:22 +1000

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

Maybe that's ok in the face of a writeback error though? IDK.

No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 11:16:46 -0400

On Thu, Apr 12, 2018 at 10:01:22PM +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they think they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us again, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.

Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...

We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors). Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.

It's certainly that's what I would do if I didn't decide to use a hosted cloud solution, such as Cloud SQL for Postgres, and let someone else solve the really hard problems of dealing with real-world HDD failures. :-)


From:   Jeff Layton <[email protected]>
Date:   Thu, 12 Apr 2018 11:08:50 -0400

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

That can even happen before you get the chance to call fsync, so even a write()+read()+fsync() is not guaranteed to be safe in this regard today, given sufficient memory pressure.

I think the current situation is fine from a "let's not OOM at all costs" standpoint, but not so good for application predictability. We should really consider ways to do better here.

Maybe that's ok in the face of a writeback error though? IDK.

No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....

I think we we have an opportunity here to come up with better defined and hopefully more useful behavior for buffered I/O in the face of writeback errors. The first step would be to hash out what we'd want it to look like.

Maybe we need a plenary session at LSF/MM?


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 12:46:27 -0700

Hi,

On 2018-04-12 12:19:26 +0200, Lukas Czerner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.

That doesn't really help, unless you want to sync() and then re-read all the data to make sure it's the same. Rereading multi-TB backups just to know whether there was an error that the OS knew about isn't particularly fun. Without verifying after sync it's not going to improve the situation measurably, you're still only going to discover that $data isn't available when it's needed.

What you're saying here is that there's no way to use standard linux tools to manipulate files and know whether it failed, without filtering kernel logs for IO errors. Or am I missing something?


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 12:55:36 -0700

Hi,

On 2018-04-12 01:34:45 -0400, Theodore Y. Ts'o wrote:

The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon.

Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.

The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.

The problem really isn't about recovering from disk errors. Knowing about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that "we" get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.

The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 13:13:22 -0700

Hi,

On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote:

That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they think they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us again, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient. I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem. If the drive is entirely gone there's obviously no point in keeping per-file information around, so per-blockdev/fs information suffices entirely to return an error on fsync (which at least on ext4 appears to happen if the underlying blockdev is gone).

Have fun making up things we want, but I'm not sure it's particularly productive.

Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...

Well, that's why I'm discussing here so we can figure out what's acceptable before considering wasting money and revew cycles doing or paying somebody to do some crazy useless shit.

We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors).

Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd be upstream. And yes, that'd be perfect.

Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.

As I said earlier, I think there's good reason to move to DIO for postgres. But to keep that performant is going to need some serious work.

But afaict such a solution wouldn't really depend on applications using DIO or not. Before finishing a checkpoint (logging it persistently and allowing to throw older data away), we could check if any errors have been reported and give up if there have been any. And after starting postgres on a directory restored from backup using $tool, we can fsync the directory recursively, check for such errors, and give up if there've been any.


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 13:24:57 -0700

On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care that much.

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to
2) oldcount=$(cat /sys/$whatnot/unreported_errors)
3) filesystem operations in $directory
4) sync;sync;
5) newcount=$(cat /sys/$whatnot/unreported_errors)
6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.


From:   Matthew Wilcox <[email protected]>
Date:   Thu, 12 Apr 2018 13:28:30 -0700

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 14:11:45 -0700

On 2018-04-12 07:24:12 -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > To pound the broken record: there are many good reasons why Linux > filesystem developers have said "you should use direct IO" to the PG > devs each time we have this "the kernel doesn't do PG needs>" discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

syncfs isn't standardized, it operates on an entire filesystem (thus writing out unnecessary stuff), it has no meaningful documentation of it's return codes. Yes, using syncfs() might better performancewise, but it doesn't seem like it actually solves anything, performance aside:

Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.

You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.

Which again doesn't allow one to use any non-bespoke tooling (like tar or whatnot). And it means you'll have to call syncfs() every few hundred files, because you'll obviously run into filehandle limitations.


From:   Jeff Layton <[email protected]>
Date:   Thu, 12 Apr 2018 17:14:54 -0400

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 17:21:44 -0400

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.

Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.

If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.


From:   Matthew Wilcox <[email protected]>
Date:   Thu, 12 Apr 2018 14:24:32 -0700

On Thu, Apr 12, 2018 at 05:21:44PM -0400, Theodore Y. Ts'o wrote:

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

That's not how errseq works, Ted ;-)

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

Only ones which occur after the last sampling get reported through this particular file descriptor.


From:   Jeff Layton <[email protected]>
Date:   Thu, 12 Apr 2018 17:27:54 -0400

On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:

On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care that much.

I'll have to respectfully disagree. Why should I interpret an error on a read() syscall to mean that writeback failed? Note that the data is still potentially intact.

What might make sense, IMO, is to just invalidate the pages that failed to be written back. Then you could potentially do a read to fault them in again (i.e. sync the pagecache and the backing store) and possibly redirty them for another try.

Note that you can detect this situation by checking the return code from fsync. It should report the latest error once per file description.

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

syncfs could use some work.

I'm warming to willy's idea to add a per-sb errseq_t. I think that might be a simple way to get better semantics here. Not sure how we want to handle the reporting end yet though...

We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.


From:   Matthew Wilcox <[email protected]>
Date:   Thu, 12 Apr 2018 14:31:10 -0700

On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.

Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 14:37:56 -0700

On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote:

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

I don't think unmount + resettable via /sys would be an insane approach. Requiring explicit action to acknowledge data loss isn't a crazy concept. But I think that's something reasonable minds could disagree with.

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

If it were tied to syncfs, I wonder if there's a way to have some errseq type logic. Store a per superblock (or whatever equivalent thing) errseq value of errors. For each fd calling syncfs() report the error once, but then store the current value in a separate per-fd field. And if that's considered too weird, only report the errors to fds that have been opened from before the error occurred.

I can see writing a tool 'pg_run_and_sync /directo /ries -- command' which opens an fd for each of the filesystems the directories reside on, and calls syncfs() after. That'd allow to use backup/restore tools at least semi safely.

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.

Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.

Well, it's an open list, everyone can argue. And initially people at first didn't know the OOM explanation, and then it takes some time to revise ones priors :). I think it's a design question that reasonable people can disagree upon (if "hot" removed devices are handled by throwing data away regardless, at least). But as it's clearly not something viable, we can move on to something that can solve the problem.

If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.

Yea, I agree that that's not obvious. One way would be to say that it's only automatically cleared when you unlink the file. A bit heavyhanded, but not too crazy.


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 17:52:52 -0400

On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote:

Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.

Yeah, sorry, it never got upstreamed. It's not really all that complicated, it was just that there were some other folks who wanted to do something similar, and there was a round of bike-sheddingh several years ago, and nothing ever went upstream. Part of the problem was that our orignial scheme sent up information about file system-level corruption reports --- e.g, those stemming from calls to ext4_error() --- and lots of people had different ideas about how tot get all of the possible information up in some structured format. (Think something like uerf from Digtial's OSF/1.)

We did something really simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.

The problem really isn't about recovering from disk errors. Knowing about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that "we" get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.

Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't that much variability.

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.

Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's possible that might help, but it's not likely to be helpful in my experience.


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 14:53:19 -0700

On 2018-04-12 17:27:54 -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

syncfs could use some work.

It's really too bad that it doesn't have a flags argument.

We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.

ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places. By my reading XFS doesn't seem to have something similar.

Wouldn't be bad to standardize...


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 12 Apr 2018 17:57:56 -0400

On Thu, Apr 12, 2018 at 02:53:19PM -0700, Andres Freund wrote:

Isn't beautiful to script, but it's also not absolutely terrible.

ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places.

This is only for file system inconsistencies noticed by the kernel. We don't bump that count for data block I/O errors.

The same idea could be used on a block device level. It would be pretty simple to maintain a counter for I/O errors, and when the last error was detected on a particular device. You could evne break out and track read errors and write errors eparately if that would be useful.

If you don't care what block was bad, but just that some I/O error had happened, a counter is definitely the simplest approach, and less hair to implemnet and use than something like a netlink channel or scraping dmesg....


From:   Andres Freund <[email protected]>
Date:   Thu, 12 Apr 2018 15:03:59 -0700

Hi,

On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote:

We did something really simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.

The worst part of that is, as you mention below, needing to handle a lot of different error message formats. I guess it's reasonable enough if you control your hardware, but no such luck.

Aren't there quite realistic scenarios where one could miss kmsg style messages due to it being a ringbuffer?

Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't that much variability.

There's also SAN / NAS type stuff - not all of that presents as a SCSI/SATA device, right?

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.

Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's possible that might help, but it's not likely to be helpful in my experience.

Depends on the type of error and storage. ENOSPC, especially over NFS, has some reasonable chances of being cleared up. And for networked block storage it's also not impossible to think of scenarios where that'd work for EIO.

But I think besides hope of clearing up itself, it has the advantage that it trivially can give some feedback to the user. The user'll get back strerror(ENOSPC) with some decent SQL error code, which'll hopefully cause them to investigate (well, once monitoring detects high error rates). It's much nicer for the user to type COMMIT; get an appropriate error back etc, than if the database just commits suicide.


From:   Dave Chinner <[email protected]>
Date:   Fri, 13 Apr 2018 08:44:04 +1000

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Posix says this about write():

  After a write() to a regular file has successfully returned:

     Any successful read() from each byte position in the file that
     was modified by that write shall return the data specified by
     the write() for that position until such byte positions are
     again modified.

IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....


From:   Jeff Layton <[email protected]>
Date:   Fri, 13 Apr 2018 08:56:38 -0400

On Thu, 2018-04-12 at 14:31 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.

Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.

That might work.

It'd be a syscall behavioral change so we'd need to document that well. It's probably innocuous though -- I doubt we have a lot of callers in the field opening files with O_PATH and calling syncfs on them.


From:   Jeff Layton <[email protected]>
Date:   Fri, 13 Apr 2018 09:18:56 -0400

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Posix says this about write():

After a write() to a regular file has successfully returned:

 Any successful read() from each byte position in the file that
 was modified by that write shall return the data specified by
 the write() for that position until such byte positions are
 again modified.

IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

  • better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.
  • invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.


From:   Andres Freund <[email protected]>
Date:   Fri, 13 Apr 2018 06:25:35 -0700

Hi,

On 2018-04-13 09:18:56 -0400, Jeff Layton wrote:

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

  • better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.
  • invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

It's not perfect, but I think the amount of hacky OS specific code should be acceptable. And it does allow for a wrapper tool that can be used around backup restores etc to syncfs all the necessary filesystems. Let me mull with others for a bit.


From:   Matthew Wilcox <[email protected]>
Date:   Fri, 13 Apr 2018 07:02:32 -0700

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

  • better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.
  • invalidate or clear uptodate flag on pages that experience writebackerrors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.
  2. Background writebacks should skip pages which are PageError.
  3. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

I think kupdate writes are the same as for_background writes. for_reclaim is tougher. I don't want to see us getting into OOM because we're hanging onto stale data, but we don't necessarily have an open fd to report the error on. I think I'm leaning towards behaving the same for for_reclaim as for_sync, but this is probably a subject on which reasonable people can disagree.

And this logic all needs to be on one place, although invoked from each filesystem.


From:   Matthew Wilcox <[email protected]>
Date:   Fri, 13 Apr 2018 07:48:07 -0700

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether anyone has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox [email protected]

diff --git a/lib/errseq.c b/lib/errseq.c
index df782418b333..093f1fba4ee0 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
 errseq_t errseq_sample(errseq_t *eseq)
 {
 	errseq_t old = READ_ONCE(*eseq);
-	errseq_t new = old;
 
-	/*
-	 * For the common case of no errors ever having been set, we can skip
-	 * marking the SEEN bit. Once an error has been set, the value will
-	 * never go back to zero.
-	 */
-	if (old != 0) {
-		new |= ERRSEQ_SEEN;
-		if (old != new)
-			cmpxchg(eseq, old, new);
-	}
-	return new;
+	/* If nobody has seen this error yet, then we can be the first. */
+	if (!(old & ERRSEQ_SEEN))
+		old = 0;
+	return old;
 }
 EXPORT_SYMBOL(errseq_sample);

From:   Dave Chinner <[email protected]>
Date:   Sat, 14 Apr 2018 11:47:52 +1000

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

  • better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.

  • invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

We used to do this with XFS metadata. We'd just keep trying to write metadata and keep the filesystem running (because it's consistent in memory and it might be a transient error) rather than shutting down the filesystem after a couple of retries. the result was that users wouldn't notice there were problems until unmount, and the most common sympton of that was "why is system shutdown hanging?".

We now don't hang at unmount by default:

$ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 
1
$

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

And this logic all needs to be on one place, although invoked from each filesystem.

Perhaps so, but as there's no "one-size-fits-all" behaviour, I really want to extend the XFS error config infrastructure to control what the filesystem does on error here.


From:   Andres Freund <[email protected]>
Date:   Fri, 13 Apr 2018 19:04:33 -0700

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?


From:   Matthew Wilcox <[email protected]>
Date:   Fri, 13 Apr 2018 19:38:14 -0700

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

So ... exponential backoff on retries?

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.



From:   [email protected] (J. Bruce Fields)
Date:   Wed, 18 Apr 2018 12:52:19 -0400

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.


From:   [email protected] (J. Bruce Fields)
Date:   Wed, 18 Apr 2018 14:09:03 -0400

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

Hi,

On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:

On Apr 10, 2018, at 4:07 PM, Andres Freund [email protected] wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too: /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir);

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.

Is that still true if you're able to use some kind of parallelism? (async io, or fsync from multiple processes?)


From:   Dave Chinner <[email protected]>
Date:   Thu, 19 Apr 2018 09:59:50 +1000

On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?

That's for metadata writeback error behaviour, not data writeback IO errors.

We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).


From:   Dave Chinner <[email protected]>
Date:   Thu, 19 Apr 2018 10:13:43 +1000

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

So ... exponential backoff on retries?

Maybe, but I don't think that actually helps anything and adds yet more "when should we write this" complication to inode writeback....

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....


From:   Eric Sandeen <[email protected]>
Date:   Wed, 18 Apr 2018 19:23:46 -0500

On 4/18/18 6:59 PM, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?

That's for metadata writeback error behaviour, not data writeback IO errors.

/me points casually at data_err=abort & data_err=ignore in ext4...

       data_err=ignore
              Just print an error message if an error occurs in a file data buffer in ordered mode.

       data_err=abort
              Abort the journal if an error occurs in a file data buffer in ordered mode.

Just sayin'

We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).

I don't disagree, but there are already mount-option knobs in ext4, FWIW.


From:   Matthew Wilcox <[email protected]>
Date:   Wed, 18 Apr 2018 17:40:37 -0700

On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....

Right. But then that's on the application.


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Wed, 18 Apr 2018 21:08:19 -0400

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o

Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.



From:   Christoph Hellwig <[email protected]>
Date:   Thu, 19 Apr 2018 01:39:04 -0700

On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.

Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.


From:   "J. Bruce Fields" <[email protected]>
Date:   Thu, 19 Apr 2018 10:10:16 -0400

On Thu, Apr 19, 2018 at 01:39:04AM -0700, Christoph Hellwig wrote:

On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.

Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.

Thanks! That was my understanding too. But then the name is terrible. As is all the vendor documentation I can find:

https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/

"Power loss protection is a critical aspect of ensuring data integrity, especially in servers or data centers."

https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf

"Data safety features prepare for unexpected power-loss and protect system and user data."

Why do they all neglect to mention that their consumer drives are also perfectly capable of well-defined behavior after power loss, just at the expense of flush performance? It's ridiculously confusing.


From:   Matthew Wilcox <[email protected]>
Date:   Thu, 19 Apr 2018 10:40:10 -0700

On Wed, Apr 18, 2018 at 09:08:19PM -0400, Theodore Y. Ts'o wrote:

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o

Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.

The device mapper target (dm-removable?) was my first idea too, but I kept thinking through use cases and I think we end up wanting this functionality in the block layer. Let's try a story.

Stephen the PFY goes into the data centre looking to hotswap a failed drive. Due to the eight pints of lager he had for lunch, he pulls out the root drive instead of the failed drive. The air raid siren warbles and he realises his mistake, shoving the drive back in.

CYOA:

Currently: All writes are lost, calamities ensue. The PFY is fired.

With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.

Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.


From:   "Theodore Y. Ts'o" <[email protected]>
Date:   Thu, 19 Apr 2018 19:27:15 -0400

On Thu, Apr 19, 2018 at 10:40:10AM -0700, Matthew Wilcox wrote:

With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.

Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.

What you're talking about is a deployment issue, though. Ultimately the distribution will set up dm-removable automatically if the user requests it, much like it sets up dm-crypt automatically for laptop users upon request.

My concern is that not all removable devices have a globally unique id number available in hardware so the kernel can tell whether or not it's the same device that has been plugged in. There are hueristics you could use -- for example, you could look at the file system uuid plus the last fsck time. But they tend to be very file system specific, and not things we would want ot have in the kernel.


From:   Dave Chinner <[email protected]>
Date:   Fri, 20 Apr 2018 09:28:59 +1000

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.

nod

But in the meantime, device unplug (should give ENODEV, not EIO) is a fatal error and we need to toss away the data.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).

But if it's still failing, then we should be still trying to report the error. i.e. if fsync fails and the page remains dirty, then the next attmept to write it is a new error and fsync should report that. IOWs, I think we should be returning errors at every occasion errors need to be reported if we have a persistent writeback failure...

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....

Right. But then that's on the application.

Which we know don't do the right thing. Seems like a lot of hoops to jump through given it still won't work if the appliction isn't changed to support linux specific error handling requirements...


From:   Jan Kara <[email protected]>
Date:   Sat, 21 Apr 2018 18:59:54 +0200

On Fri 13-04-18 07:48:07, Matthew Wilcox wrote:

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether anyone has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox [email protected]

So I agree with going to the old semantics of reporting errors from before a file was open at least once to someone. As the PG case shows apps are indeed relying on the old behavior. As much as it is unreliable, it ends up doing the right thing for these apps in 99% of cases and we shouldn't break them (BTW IMO the changelog should contain a note that this fixes a regression of PostgreSQL, a reference to this thread and CC to stable). Anyway feel free to add:

Reviewed-by: Jan Kara [email protected]

Oh, and to make myself clear I do think we need to find a better way of reporting IO errors. I consider this just an immediate band-aid to avoid userspace regressions.

diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old;

  • /*
  • * For the common case of no errors ever having been set, we can skip
  • * marking the SEEN bit. Once an error has been set, the value will
  • * never go back to zero.
  • */
  • if (old != 0) {
  • new |= ERRSEQ_SEEN;
  • if (old != new)
  • cmpxchg(eseq, old, new);
  • }
  • return new;
  • /* If nobody has seen this error yet, then we can be the first. */
  • if (!(old & ERRSEQ_SEEN))
  • old = 0;
  • return old;

From:   Jan Kara <[email protected]>
Date:   Sat, 21 Apr 2018 20:14:29 +0200

On Thu 12-04-18 07:09:14, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

So this is never going to be perfect but I think we could do good enough by: 1) Mark inodes that hit IO error. 2) If the inode gets evicted from memory we store the fact that we hit an error for this IO in a more space efficient data structure (sparse bitmap, radix tree, extent tree, whatever). 3) If the underlying device gets destroyed, we can just switch the whole SB to an error state and forget per inode info. 4) If there's too much of per-inode error info (probably per-fs configurable limit in terms of number of inodes), we would yell in the kernel log, switch the whole fs to the error state and forget per inode info.

This way there won't be silent loss of IO errors. Memory usage would be reasonably limited. It could happen the whole fs would switch to error state "prematurely" but if that's a problem for the machine, admin could tune the limit for number of inodes to keep IO errors for...

I think the crux of the matter here is not really about error reporting, per-se.

I think this is related but a different question.

I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.

I can see the admin wanting to rather kill the machine with OOM than having to deal with data loss due to IO errors (e.g. if he has HA server fail over set up). Or retry for some time before dropping the dirty data. Or do what we do now (possibly with invalidating pages as you say). As Dave said elsewhere there's not one strategy that's going to please everybody. So it might be beneficial to have this configurable like XFS has it for metadata.

OTOH if I look at the problem from application developer POV, most apps will just declare game over at the face of IO errors (if they take care to check for them at all). And the sophisticated apps that will try some kind of error recovery have to be prepared that the data is just gone (as depending on what exactly the kernel does is rather fragile) so I'm not sure how much practical value the configurable behavior on writeback errors would bring.


Computer latency: 1977-2017

2017-12-24 08:00:00

I've had this nagging feeling that the computers I use today feel slower than the computers I used as a kid. As a rule, I don’t trust this kind of feeling because human perception has been shown to be unreliable in empirical studies, so I carried around a high-speed camera and measured the response latency of devices I’ve run into in the past few months. Here are the results:

computerlatency
(ms)
yearclock# T
apple 2e3019831 MHz3.5k
ti 99/4a4019813 MHz8k
custom haswell-e 165Hz5020143.5 GHz2G
commodore pet 40166019771 MHz3.5k
sgi indy601993.1 GHz1.2M
custom haswell-e 120Hz6020143.5 GHz2G
thinkpad 13 chromeos7020172.3 GHz1G
imac g4 os 9702002.8 GHz11M
custom haswell-e 60Hz8020143.5 GHz2G
mac color classic90199316 MHz273k
powerspec g405 linux 60Hz9020174.2 GHz2G
macbook pro 201410020142.6 GHz700M
thinkpad 13 linux chroot10020172.3 GHz1G
lenovo x1 carbon 4g linux11020162.6 GHz1G
imac g4 os x1202002.8 GHz11M
custom haswell-e 24Hz14020143.5 GHz2G
lenovo x1 carbon 4g win15020162.6 GHz1G
next cube150198825 MHz1.2M
powerspec g405 linux17020174.2 GHz2G
packet around the world190
powerspec g405 win20020174.2 GHz2G
symbolics 362030019865 MHz390k

These are tests of the latency between a keypress and the display of a character in a terminal (see appendix for more details). The results are sorted from quickest to slowest. In the latency column, the background goes from green to yellow to red to black as devices get slower and the background gets darker as devices get slower. No devices are green. When multiple OSes were tested on the same machine, the os is in bold. When multiple refresh rates were tested on the same machine, the refresh rate is in italics.

In the year column, the background gets darker and purple-er as devices get older. If older devices were slower, we’d see the year column get darker as we read down the chart.

The next two columns show the clock speed and number of transistors in the processor. Smaller numbers are darker and blue-er. As above, if slower clocked and smaller chips correlated with longer latency, the columns would get darker as we go down the table, but it, if anything, seems to be the other way around.

For reference, the latency of a packet going around the world through fiber from NYC back to NYC via Tokyo and London is inserted in the table.

If we look at overall results, the fastest machines are ancient. Newer machines are all over the place. Fancy gaming rigs with unusually high refresh-rate displays are almost competitive with machines from the late 70s and early 80s, but “normal” modern computers can’t compete with thirty to forty year old machines.

We can also look at mobile devices. In this case, we’ll look at scroll latency in the browser:

devicelatency
(ms)
year
ipad pro 10.5" pencil302017
ipad pro 10.5"702017
iphone 4s702011
iphone 6s702015
iphone 3gs702009
iphone x802017
iphone 8802017
iphone 7802016
iphone 6802014
gameboy color801998
iphone 5902012
blackberry q101002013
huawei honor 81102016
google pixel 2 xl1102017
galaxy s71202016
galaxy note 31202016
moto x1202013
nexus 5x1202015
oneplus 3t1302016
blackberry key one1302017
moto e (2g)1402015
moto g4 play1402017
moto g4 plus1402016
google pixel1402016
samsung galaxy avant1502014
asus zenfone3 max1502016
sony xperia z5 compact1502015
htc one m41602013
galaxy s4 mini1702013
lg k41802016
packet190
htc rezound2402011
palm pilot 10004901996
kindle oasis 25702017
kindle paperwhite 36302015
kindle 48602011

As above, the results are sorted by latency and color-coded from green to yellow to red to black as devices get slower. Also as above, the year gets purple-er (and darker) as the device gets older.

If we exclude the game boy color, which is a different class of device than the rest, all of the quickest devices are Apple phones or tablets. The next quickest device is the blackberry q10. Although we don’t have enough data to really tell why the blackberry q10 is unusually quick for a non-Apple device, one plausible guess is that it’s helped by having actual buttons, which are easier to implement with low latency than a touchscreen. The other two devices with actual buttons are the gameboy color and the kindle 4.

After that iphones and non-kindle button devices, we have a variety of Android devices of various ages. At the bottom, we have the ancient palm pilot 1000 followed by the kindles. The palm is hamstrung by a touchscreen and display created in an era with much slower touchscreen technology and the kindles use e-ink displays, which are much slower than the displays used on modern phones, so it’s not surprising to see those devices at the bottom.

Why is the apple 2e so fast?

Compared to a modern computer that’s not the latest ipad pro, the apple 2 has significant advantages on both the input and the output, and it also has an advantage between the input and the output for all but the most carefully written code since the apple 2 doesn’t have to deal with context switches, buffers involved in handoffs between different processes, etc.

On the input, if we look at modern keyboards, it’s common to see them scan their inputs at 100 Hz to 200 Hz (e.g., the ergodox claims to scan at 167 Hz). By comparison, the apple 2e effectively scans at 556 Hz. See appendix for details.

If we look at the other end of the pipeline, the display, we can also find latency bloat there. I have a display that advertises 1 ms switching on the box, but if we look at how long it takes for the display to actually show a character from when you can first see the trace of it on the screen until the character is solid, it can easily be 10 ms. You can even see this effect with some high-refresh-rate displays that are sold on their allegedly good latency.

At 144 Hz, each frame takes 7 ms. A change to the screen will have 0 ms to 7 ms of extra latency as it waits for the next frame boundary before getting rendered (on average,we expect half of the maximum latency, or 3.5 ms). On top of that, even though my display at home advertises a 1 ms switching time, it actually appears to take 10 ms to fully change color once the display has started changing color. When we add up the latency from waiting for the next frame to the latency of an actual color change, we get an expected latency of 7/2 + 10 = 13.5ms

With the old CRT in the apple 2e, we’d expect half of a 60 Hz refresh (16.7 ms / 2) plus a negligible delay, or 8.3 ms. That’s hard to beat today: a state of the art “gaming monitor” can get the total display latency down into the same range, but in terms of marketshare, very few people have such displays, and even displays that are advertised as being fast aren’t always actually fast.

iOS rendering pipeline

If we look at what’s happening between the input and the output, the differences between a modern system and an apple 2e are too many to describe without writing an entire book. To get a sense of the situation in modern machines, here’s former iOS/UIKit engineer Andy Matuschak’s high-level sketch of what happens on iOS, which he says should be presented with the disclaimer that “this is my out of date memory of out of date information”:

  • hardware has its own scanrate (e.g. 120 Hz for recent touch panels), so that can introduce up to 8 ms latency
  • events are delivered to the kernel through firmware; this is relatively quick but system scheduling concerns may introduce a couple ms here
  • the kernel delivers those events to privileged subscribers (here, backboardd) over a mach port; more scheduling loss possible
  • backboardd must determine which process should receive the event; this requires taking a lock against the window server, which shares that information (a trip back into the kernel, more scheduling delay)
  • backboardd sends that event to the process in question; more scheduling delay possible before it is processed
  • those events are only dequeued on the main thread; something else may be happening on the main thread (e.g. as result of a timer or network activity), so some more latency may result, depending on that work
  • UIKit introduced 1-2 ms event processing overhead, CPU-bound
  • application decides what to do with the event; apps are poorly written, so usually this takes many ms. the consequences are batched up in a data-driven update which is sent to the render server over IPC
    • If the app needs a new shared-memory video buffer as a consequence of the event, which will happen anytime something non-trivial is happening, that will require round-trip IPC to the render server; more scheduling delays
    • (trivial changes are things which the render server can incorporate itself, like affine transformation changes or color changes to layers; non-trivial changes include anything that has to do with text, most raster and vector operations)
    • These kinds of updates often end up being triple-buffered: the GPU might be using one buffer to render right now; the render server might have another buffer queued up for its next frame; and you want to draw into another. More (cross-process) locking here; more trips into kernel-land.
  • the render server applies those updates to its render tree (a few ms)
  • every N Hz, the render tree is flushed to the GPU, which is asked to fill a video buffer
    • Actually, though, there’s often triple-buffering for the screen buffer, for the same reason I described above: the GPU’s drawing into one now; another might be being read from in preparation for another frame
  • every N Hz, that video buffer is swapped with another video buffer, and the display is driven directly from that memory
    • (this N Hz isn’t necessarily ideally aligned with the preceding step’s N Hz)

Andy says “the actual amount of work happening here is typically quite small. A few ms of CPU time. Key overhead comes from:”

  • periodic scanrates (input device, render server, display) imperfectly aligned
  • many handoffs across process boundaries, each an opportunity for something else to get scheduled instead of the consequences of the input event
  • lots of locking, especially across process boundaries, necessitating trips into kernel-land

By comparison, on the Apple 2e, there basically aren’t handoffs, locks, or process boundaries. Some very simple code runs and writes the result to the display memory, which causes the display to get updated on the next scan.

Refresh rate vs. latency

One thing that’s curious about the computer results is the impact of refresh rate. We get a 90 ms improvement from going from 24 Hz to 165 Hz. At 24 Hz each frame takes 41.67 ms and at 165 Hz each frame takes 6.061 ms. As we saw above, if there weren’t any buffering, we’d expect the average latency added by frame refreshes to be 20.8ms in the former case and 3.03 ms in the latter case (because we’d expect to arrive at a uniform random point in the frame and have to wait between 0ms and the full frame time), which is a difference of about 18ms. But the difference is actually 90 ms, implying we have latency equivalent to (90 - 18) / (41.67 - 6.061) = 2 buffered frames.

If we plot the results from the other refresh rates on the same machine (not shown), we can see that they’re roughly in line with a “best fit” curve that we get if we assume that, for that machine running powershell, we get 2.5 frames worth of latency regardless of refresh rate. This lets us estimate what the latency would be if we equipped this low latency gaming machine with an infinity Hz display -- we’d expect latency to be 140 - 2.5 * 41.67 = 36 ms, almost as fast as quick but standard machines from the 70s and 80s.

Complexity

Almost every computer and mobile device that people buy today is slower than common models of computers from the 70s and 80s. Low-latency gaming desktops and the ipad pro can get into the same range as quick machines from thirty to forty years ago, but most off-the-shelf devices aren’t even close.

If we had to pick one root cause of latency bloat, we might say that it’s because of “complexity”. Of course, we all know that complexity is bad. If you’ve been to a non-academic non-enterprise tech conference in the past decade, there’s a good chance that there was at least one talk on how complexity is the root of all evil and we should aspire to reduce complexity.

Unfortunately, it's a lot harder to remove complexity than to give a talk saying that we should remove complexity. A lot of the complexity buys us something, either directly or indirectly. When we looked at the input of a fancy modern keyboard vs. the apple 2 keyboard, we saw that using a relatively powerful and expensive general purpose processor to handle keyboard inputs can be slower than dedicated logic for the keyboard, which would both be simpler and cheaper. However, using the processor gives people the ability to easily customize the keyboard, and also pushes the problem of “programming” the keyboard from hardware into software, which reduces the cost of making the keyboard. The more expensive chip increases the manufacturing cost, but considering how much of the cost of these small-batch artisanal keyboards is the design cost, it seems like a net win to trade manufacturing cost for ease of programming.

We see this kind of tradeoff in every part of the pipeline. One of the biggest examples of this is the OS you might run on a modern desktop vs. the loop that’s running on the apple 2. Modern OSes let programmers write generic code that can deal with having other programs simultaneously running on the same machine, and do so with pretty reasonable general performance, but we pay a huge complexity cost for this and the handoffs involved in making this easy result in a significant latency penalty.

A lot of the complexity might be called accidental complexity, but most of that accidental complexity is there because it’s so convenient. At every level from the hardware architecture to the syscall interface to the I/O framework we use, we take on complexity, much of which could be eliminated if we could sit down and re-write all of the systems and their interfaces today, but it’s too inconvenient to re-invent the universe to reduce complexity and we get benefits from economies of scale, so we live with what we have.

For those reasons and more, in practice, the solution to poor performance caused by “excess” complexity is often to add more complexity. In particular, the gains we’ve seen that get us back to the quickness of the quickest machines from thirty to forty years ago have come not from listening to exhortations to reduce complexity, but from piling on more complexity.

The ipad pro is a feat of modern engineering; the engineering that went into increasing the refresh rate on both the input and the output as well as making sure the software pipeline doesn’t have unnecessary buffering is complex! The design and manufacture of high-refresh-rate displays that can push system latency down is also non-trivially complex in ways that aren’t necessary for bog standard 60 Hz displays.

This is actually a common theme when working on latency reduction. A common trick to reduce latency is to add a cache, but adding a cache to a system makes it more complex. For systems that generate new data and can’t tolerate a cache, the solutions are often even more complex. An example of this might be large scale RoCE deployments. These can push remote data access latency from from the millisecond range down to the microsecond range, which enables new classes of applications. However, this has come at a large cost in complexity. Early large-scale RoCE deployments easily took tens of person years of effort to get right and also came with a tremendous operational burden.

Conclusion

It’s a bit absurd that a modern gaming machine running at 4,000x the speed of an apple 2, with a CPU that has 500,000x as many transistors (with a GPU that has 2,000,000x as many transistors) can maybe manage the same latency as an apple 2 in very carefully coded applications if we have a monitor with nearly 3x the refresh rate. It’s perhaps even more absurd that the default configuration of the powerspec g405, which had the fastest single-threaded performance you could get until October 2017, had more latency from keyboard-to-screen (approximately 3 feet, maybe 10 feet of actual cabling) than sending a packet around the world (16187 mi from NYC to Tokyo to London back to NYC, more due to the cost of running the shortest possible length of fiber).

On the bright side, we’re arguably emerging from the latency dark ages and it’s now possible to assemble a computer or buy a tablet with latency that’s in the same range as you could get off-the-shelf in the 70s and 80s. This reminds me a bit of the screen resolution & density dark ages, where CRTs from the 90s offered better resolution and higher pixel density than affordable non-laptop LCDs until relatively recently. 4k displays have now become normal and affordable 8k displays are on the horizon, blowing past anything we saw on consumer CRTs. I don’t know that we’ll see the same kind improvement with respect to latency, but one can hope. There are individual developers improving the experience for people who use certain, very carefully coded, applications, but it's not clear what force could cause a significant improvement in the default experience most users see.

Appendix: why measure latency?

Latency matters! For very simple tasks, people can perceive latencies down to 2 ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

The most commonly cited document on response time is the nielsen group article on response times, which claims that latncies below 100ms feel equivalent and perceived as instantaneous. One easy way to see that this is false is to go into your terminal and try sleep 0; echo "pong" vs. sleep 0.1; echo "test" (or for that matter, try playing an old game that doesn't have latency compensation, like quake 1, with 100 ms ping, or even 30 ms ping, or try typing in a terminal with 30 ms ping). For more info on this and other latency fallacies, see this document on common misconceptions about latency.

Throughput also matters, but this is widely understood and measured. If you go to pretty much any mainstream review or benchmarking site, you can find a wide variety of throughput measurements, so there’s less value in writing up additional throughput measurements.

Appendix: apple 2 keyboard

The apple 2e, instead of using a programmed microcontroller to read the keyboard, uses a much simpler custom chip designed for reading keyboard input, the AY 3600. If we look at the AY 3600 datasheet,we can see that the scan time is (90 * 1/f) and the debounce time is listed as strobe_delay. These quantities are determined by some capacitors and a resistor, which appear to be 47pf, 100k ohms, and 0.022uf for the Apple 2e. Plugging these numbers into the AY3600 datasheet, we can see that f = 50 kHz, giving us a 1.8 ms scan delay and a 6.8 ms debounce delay (assuming the values are accurate -- capacitors can degrade over time, so we should expect the real delays to be shorter on our old Apple 2e), giving us less than 8.6 ms for the internal keyboard logic.

Comparing to a keyboard with a 167 Hz scan rate that scans two extra times to debounce, the equivalent figure is 3 * 6 ms = 18 ms. With a 100Hz scan rate, that becomes 3 * 10 ms = 30 ms. 18 ms to 30 ms of keyboard scan plus debounce latency is in line with what we saw when we did some preliminary keyboard latency measurements.

For reference, the ergodox uses a 16 MHz microcontroller with ~80k transistors and the apple 2e CPU is a 1 MHz chip with 3.5k transistors.

Appendix: why should android phones have higher latency than old apple phones?

As we've seen, raw processing power doesn't help much with many of the causes of latency in the pipeline, like handoffs between different processes, so phones that an android phone with a 10x more powerful processor than an ancient iphone isn't guaranteed to be quicker to respond, even if it can render javascript heavy pages faster.

If you talk to people who work on non-Apple mobile CPUs, you'll find that they run benchmarks like dhrystone (a synthetic benchmark that was irrelevant even when it was created, in 1984) and SPEC2006 (an updated version of a workstation benchmark that was relevant in the 90s and perhaps even as late as the early 2000s if you care about workstation workloads, which are completely different from mobile workloads). This problem where the vendor who makes the component has an intermediate target that's only weakly correlated to the actual user experience. I've heard that there are people working on the pixel phones who care about end-to-end latency, but it's difficult to get good latency when you have to use components that are optimized for things like dhrystone and SPEC2006.

If you talk to people at Apple, you'll find that they're quite cagey, but that they've been targeting the end-to-end user experience for quite a long time and they they can do "full stack" optimizations that are difficult for android vendors to pull of. They're not literally impossible, but making a change to a chip that has to be threaded up through the OS is something you're very unlikely to see unless google is doing the optimization, and google hasn't really been serious about the end-to-end experience until recently.

Having relatively poor performance in aspects that aren't measured is a common theme and one we saw when we looked at terminal latency. Prior to examining temrinal latency, public benchmarks were all throughput oriented and the terminals that priortized performance worked on increasing throughput, even though increasing terminal throughput isn't really useful. After those terminal latency benchmarks, some terminal authors looked into their latency and found places they could trim down buffering and remove latency. You get what you measure.

Appendix: experimental setup

Most measurements were taken with the 240fps camera (4.167 ms resolution) in the iPhone SE. Devices with response times below 40 ms were re-measured with a 1000fps camera (1 ms resolution), the Sony RX100 V in PAL mode. Results in the tables are the results of multiple runs and are rounded to the nearest 10 ms to avoid the impression of false precision. For desktop results, results are measured from when the key started moving until the screen finished updating. Note that this is different from most key-to-screen-update measurements you can find online, which typically use a setup that effectively removes much or all of the keyboard latency, which, as an end-to-end measurement, is only realistic if you have a psychic link to your computer (this isn't to say the measurements aren't useful -- if, as a programmer, you want a reproducible benchmark, it's nice to reduce measurement noise from sources that are beyond your control, but that's not relevant to end users). People often advocate measuring from one of: {the key bottoming out, the tactile feel of the switch}. Other than for measurement convenience, there appears to be no reason to do any of these, but people often claim that's when the user expects the keyboard to "really" work. But these are independent of when the switch actually fires. Both the distance between the key bottoming out and activiation as well as the distance between feeling feedback and activation are arbitrary and can be tuned. See this post on keyboard latency measurements for more info on keyboard fallacies.

Another significant difference is that measurements were done with settings as close to the default OS settings as possible since approximately 0% of users will futz around with display settings to reduce buffering, disable the compositor, etc. Waiting until the screen has finished updating is also different from most end-to-end measurements do -- most consider the update "done" when any movement has been detected on the screen. Waiting until the screen is finished changing is analogous to webpagetest's "visually complete" time.

Computer results were taken using the “default” terminal for the system (e.g., powershell on windows, lxterminal on lubuntu), which could easily cause 20 ms to 30 ms difference between a fast terminal and a slow terminal. Between measuring time in a terminal and measuring the full end-to-end time, measurements in this article should be slower than measurements in other, similar, articles (which tend to measure time to first change in games).

The powerspec g405 baseline result is using integrated graphics (the machine doesn’t come with a graphics card) and the 60 Hz result is with a cheap video card. The baseline was result was at 30 Hz because the integrated graphics only supports hdmi output and the display it was attached to only runs at 30 Hz over hdmi.

Mobile results were done by using the default browser, browsing to https://danluu.com, and measuring the latency from finger movement until the screen first updates to indicate that scrolling has occurred. In the cases where this didn’t make sense, (kindles, gameboy color, etc.), some action that makes sense for the platform was taken (changing pages on the kindle, pressing the joypad on the gameboy color in a game, etc.). Unlike with the desktop/laptop measurements, this end-time for the measurement was on the first visual change to avoid including many frames of scrolling. To make the measurement easy, the measurement was taken with a finger on the touchscreen and the timer was started when the finger started moving (to avoid having to determine when the finger first contacted the screen).

In the case of “ties”, results are ordered by the unrounded latency as a tiebreaker, but this shouldn’t be considered significant. Differences of 10 ms should probably also not be considered significant.

The custom haswell-e was tested with gsync on and there was no observable difference. The year for that box is somewhat arbitrary, since the CPU is from 2014, but the display is newer (I believe you couldn’t get a 165 Hz display until 2015.

The number of transistors for some modern machines is a rough estimate because exact numbers aren’t public. Feel free to ping me if you have a better estimate!

The color scales for latency and year are linear and the color scales for clock speed and number of transistors are log scale.

All Linux results were done with a pre-KPTI kernel. It's possible that KPTI will impact user perceivable latency.

Measurements were done as cleanly as possible (without other things running on the machine/device when possible, with a device that was nearly full on battery for devices with batteries). Latencies when other software is running on the device or when devices are low on battery might be much higher.

If you want a reference to compare the kindle against, a moderately quick page turn in a physical book appears to be about 200 ms.

This is a work in progress. I expect to get benchmarks from a lot more old computers the next time I visit Seattle. If you know of old computers I can test in the NYC area (that have their original displays or something like them), let me know! If you have a device you’d like to donate for testing, feel free to mail it to

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

Thanks to RC, David Albert, Bert Muthalaly, Christian Ternus, Kate Murphy, Ikhwan Lee, Peter Bhat Harkins, Leah Hanson, Alicia Thilani Singham Goodwin, Amy Huang, Dan Bentley, Jacquin Mininger, Rob, Susan Steinman, Raph Levien, Max McCrea, Peter Town, Jon Cinque, Anonymous, and Jonathan Dahan for donating devices to test and thanks to Leah Hanson, Andy Matuschak, Milosz Danczak, amos (@fasterthanlime), @emitter_coupled, Josh Jordan, mrob, and David Albert for comments/corrections/discussion.

How good are decisions? Evaluating decision quality in domains where evaluation is easy

2017-11-21 08:00:00

A statement I commonly hear in tech-utopian circles is that some seeming inefficiency can’t actually be inefficient because the market is efficient and inefficiencies will quickly be eliminated. A contentious example of this is the claim that companies can’t be discriminating because the market is too competitive to tolerate discrimination. A less contentious example is that when you see a big company doing something that seems bizarrely inefficient, maybe it’s not inefficient and you just lack the information necessary to understand why the decision was efficient. These kinds of statements are often accompanied by statements about how "incentives matter" or the CEO has "skin in the game" whereas the commentator does not.

Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.

In those post, we'll look at two classes of examples where we can see how good people's decisions are and how they respond to easy to obtain data showing that people are making bad decisions. Both classes of examples are from domains where the people making or discussing the decision seem to care a lot about the decision and the data clearly show that the decisions are very poor.

The first class of example comes from sports and the second comes from board games. One nice thing about sports is that they often have detailed play-by-play data and well-defined win criteria which lets us tell, on average, what the expected value of a decision is. In this post, we’ll look at the cost of bad decision making in one sport and then briefly discuss why decision quality in sports might be the same or better as decision quality in other fields. Sports are fertile ground because decision making was non-data driven and generally terrible until fairly recently, so we have over a century of information for major U.S. sports and, for a decent fraction of that time period, fans would write analyses about how poor decision making was and now much it cost teams, which teams would ignore (this has since changed and basically every team has a staff of stats-PhDs or the equivalent looking at data).

Baseball

In another post, we looked at how "hiring" decisions in sports were total nonsense. In this post, just because one of the top "rationality community" thought leaders gave the common excuse that that in-game baseball decision making by coaches isn't that costly ("Do bad in-game decisions cost games? Absolutely. But not that many games. Maybe they lose you 4 a year out of 162."; the entire post implies this isn't a big deal and it's fine to throw away 4 games), we'll look at how costly bad decision making is and how much teams spend to buy an equivalent number of wins in other ways. However, you could do the same kind of analysis for football, hockey, basketball, etc., and my understanding is that you’d get a roughly similar result in all of those cases.

We’re going to model baseball as a state machine, both because that makes it easy to understand the expected value of particular decisions and because this lets us talk about the value of decisions without having to go over most of the rules of baseball.

We can treat each baseball game as an independent event. In each game, two teams play against each other and the team that scores more runs (points) wins. Each game is split into 9 “innings” and in each inning each team will get one set of chances on offense. In each inning, each team will play until it gets 3 “outs”. Any given play may or may not result in an out.

One chunk of state in our state machine is the number of outs and the inning. The other chunks of state we’re going to track are who’s “on base” and which player is “at bat”. Each teams defines some order of batters for their active players and after each player bats once this repeats in a loop until the team collects 3 outs and the inning is over. The state of who is at bat is saved between innings. Just for example, you might see batters 1-5 bat in the first inning, 6-9 and then 1 again in the second inning, 2- … etc.

When a player is at bat, the player may advance to a base and players who are on base may also advance, depending on what happens. When a player advances 4 bases (that is, through 1B, 2B, 3B, to what would be 4B except that it isn’t called that) a run is scored and the player is removed from the base. As mentioned above, various events may cause a player to be out, in which case they also stop being on base.

An example state from our state machine is:

{1B, 3B; 2 outs}

This says that there’s a player on 1B, a player on 3B, there are two outs. Note that this is independent of the score, who’s actually playing, and the inning.

Another state is:

{--; 0 outs}

With a model like this, if we want to determine the expected value of the above state, we just need to look up the total number of runs across all innings played in a season divided by the number of innings to find the expected number of runs from the state above (ignoring the 9th inning because a quirk of baseball rules distorts statistics from the 9th inning). If we do this, we find that, from the above state, a team will score .555 runs in expectation.

We can then compute the expected number of runs for all of the other states:

012
basesouts
--.555.297.117
1B.953.573.251
2B1.189.725.344
3B1.482.983.387
1B,2B1.573.971.466
1B,3B1.9041.243.538
2B,3B2.0521.467.634
1B,2B,3B2.4171.650.815

In this table, each entry is the expected number of runs from the remainder of the inning from some particular state. Each column shows the number of outs and each row shows the state of the bases. The color coding scheme is: the starting state (.555 runs) has a white background. States with higher run expectation are more blue and states with lower run expectation are more red.

This table and the other stats in this post come from The Book by Tango et al., which mostly discussed baseball between 1999 and 2002. See the appendix if you're curious about how things change if we use a more detailed model.

The state we’re tracking for an inning here is who’s on base and the number of outs. Innings start with nobody on base and no outs.

As above, we see that we start the inning with .555 runs in expectation. If a play puts someone on 1B without getting an out, we now have .953 runs in expectation, i.e., putting someone on first without an out is worth .953 - .555 = .398 runs.

This immediately gives us the value of some decisions, e.g., trying to “steal” 2B with no outs and someone on first. If we look at cases where the batter’s state doesn’t change, a successful steal moves us to the {2B, 0 outs} state, i.e., it gives us 1.189 - .953 = .236 runs. A failed steal moves us to the {--, 1 out} state, i.e., it gives us .953 - .297 = -.656 runs. To break even, we need to succeed .656 / .236 = 2.78x more often than we fail, i.e., we need a .735 success rate to break even. If we want to compute the average value of a stolen base, we can compute the weighted sum over all states, but for now, let’s just say that it’s possible to do so and that you need something like a .735 success rate for stolen bases to make sense.

We can then look at the stolen base success rate of teams to see that, in any given season, maybe 5-10 teams are doing better than breakeven, leaving 20-25 teams at breakeven or below (mostly below). If we look at a bad but not historically bad stolen-base team of that era, they might have a .6 success rate. It wouldn’t be unusual for a team from that era to make between 100 and 200 attempts. Just so we can compute an approximation, if we assume they were all attempts from the {1B, 0 outs} state, the average run value per attempt would be .4 * (-.656) + .6 * .236 = -0.12 runs per attempt. Another first-order approximation is that a delta of 10 runs is worth 1 win, so at 100 attempts we have -1.2 wins and at 200 attempts we have -2.4 wins.

If we run the math across actual states instead of using the first order approximation, we see that the average stolen base is worth -.467 runs and the average successful steal is worth .175 runs. In that case, a steal attempt with a .6 success rate is worth .4 * (-.467) + .6 * .175 = -0.082 runs. With this new approximation, our estimate for the approximate cost in wins of stealing “as normal” vs. having a “no stealing” rule for a team that steals badly and often is .82 to 1.64 wins per season. Note that this underestimates the cost of stealing since getting into position to steal increases the odds of a successful “pickoff”, which we haven’t accounted for. From our state-machine standpoint, a pickoff is almost equivalent to a failed steal, but the analysis necessary to compute the difference in pickoff probability is beyond the scope of this post.

We can also do this for other plays coaches can cause (or prevent). For the “intentional walk”, we see that an intentional walk appears to be worth .102 runs for the opposing team. In 2002, a team that issued “a lot” of intentional walks might have issued 50, resulting in 50 * .102 runs for the opposing team, giving a loss of roughly 5 runs or .5 wins.

If we optimistically assume a “sac bunt” never fails, the cost of a sac bunt is .027 runs per attempt. If we look at the league where pitchers don’t bat, a team that was heavy on sac bunts might’ve done 49 sac bunts (we do this to avoid “pitcher” bunts, which add complexity to the approximation), costing a total of 49 * .027 = 1.32 runs or .132 wins.

Another decision that’s made by a coach is setting the batting order. Players bat (take a turn) in order, 1-9, mod 9. That is, when the 10th “player” is up, we actually go back around and the 1st player bats. At some point the game ends, so not everyone on the team ends up with the same number of “at bats”.

There’s a just-so story that justifies putting the fastest player first, someone with a high “batting average” second, someone pretty good third, your best batter fourth, etc. This story, or something like it, has been standard for over 100 years.

I’m not going to walk through the math for computing a better batting order because I don’t think there’s a short, easy to describe, approximation. It turns out that if we compute the difference between an “optimal” order and a “typical” order justified by the story in the previous paragraph, using an optimal order appears to be worth between 1 and 2 wins per season.

These approximations all leave out important information. In three out of the four cases, we assumed an average player at all times and didn’t look at who was at bat. The information above actually takes this into account to some extent, but not fully. How exactly this differs from a better approximation is a long story and probably too much detail for a post that’s using baseball to talk about decisions outside of baseball, so let’s just say that we have pretty decent but not amazing approximation that says that a coach who makes bad decisions following conventional wisdom that are in the normal range of bad decisions during a baseball season might be able cost their team something like 1 + 1.2 + .5 + .132 = 2.83 wins on these three decisions alone vs. a decision rule that says “never do these actions that, on average, have negative value”. If we compare to a better decision rule such as “do these actions when they have positive value and not when they have negative value” or a manager that generally makes good decisions, let’s conservatively estimate that’s maybe worth 3 wins.

We’ve looked at four decisions (sac bunt, steal, intentional walk, and batting order). But there are a lot of other decisions! Let’s arbitrarily say that if we look at all decisions and not just these four decisions, having a better heuristic for all decisions might be worth 4 or 5 wins per season.

What does 4 or 5 wins per season really mean? One way to look at it is that baseball teams play 162 games, so an “average” team wins 81 games. If we look at the seasons covered, the number of wins that teams that made the playoffs had was {103, 94, 103, 99, 101, 97, 98, 95, 95, 91, 116, 102, 88, 93, 93, 92, 95, 97, 95, 94, 87, 91, 91, 95, 103, 100, 97, 97, 98, 95, 97, 94}. Because of the structure of the system, we can’t name a single number for a season and say that N wins are necessary to make the playoffs and that teams with fewer than N wins won’t make the playoffs, but we can say that 95 wins gives a team decent odds of making the playoffs. 95 - 81 = 14. 5 wins is more than a third of the difference between an average team and a team that makes the playoffs. This a huge deal both in terms of prestige and also direct economic value.

If we want to look at it at the margin instead of on average, the smallest delta in wins between teams that made the playoffs and teams that didn’t in each league was {1, 7, 8, 1, 6, 2, 6, 3}. For teams that are on the edge, a delta of 5 wins wouldn’t always be the difference between a successful season (making playoffs) and an unsuccessful season (not making playoffs), but there are teams within a 5 win delta of making the playoffs in most seasons. If we were actually running a baseball team, we’d want to use a much more fine-grained model, but as a first approximation we can say that in-game decisions are a significant factor in team performance and that, using some kind of computation, we can determine the expected cost of non-optimal decisions.

Another way to look at what 5 wins is worth is to look at what it costs to get a player who’s not a pitcher that’s 5 wins above average (WAA) (we look at non-pitchers because non-pitchers tend to play in every game and pitchers tend to play in parts of some games, making a comparison between pitchers and non-pitchers more complicated). Of the 8 non-pitcher positions (we look at non-pitcher positions because it makes comparisons simpler), there are 30 teams, so we have 240 team-positions pairs. In 2002, of these 240 team-position pairs, there are two that were >= 5 WAA, Texas-SS (Alex Rodriguez, paid $22m) and SF-LF (Barry Bonds, paid $15m). If we look at the other seasons in the range of dates we’re looking at, there are either 2 or 3 team-position pairs where a team is able to get >= 5 WAA in a season These aren’t stable across seasons because player performance is volatile, so it’s not as easy as finding someone great and paying them $15m. For example, in 2002, there were 7 non-pitchers paid $14m or more and only two of them we worth 5 WAA or more. For reference, the average total team payroll (teams have 26 players per) in 2002 was $67m, with a minimum of $34m and a max of $126m. At the time a $1m salary for a manager would’ve been considered generous, making a 5 WAA manager an incredible deal.

5 WAA assumes typical decision making lining up with events in a bad, but not worst-case way. A more typical case might be that a manager costs a team 3 wins. In that case, in 2002, there were 25 team-position pairs out of 240 where a single player could make up for the loss caused from management by conventional wisdom. Players who provide that much value and who aren’t locked up in artificially cheap deals with particular teams due to the mechanics of player transfers are still much more expensive than managers.

If we look at how teams have adopted data analysis in order to improve both in-game decision making and team-composition decisions, it’s been a slow, multi-decade, process. Moneyball describes part of the shift from using intuition and observation to select players to incorporating statistics into the process. Stats nerds were talking about how you could do this at least since 1971 and no team really took it seriously until the 90s and the ideas didn’t really become mainstream until the mid 2000s, after a bestseller had been published.

If we examine how much teams have improved at the in-game decisions we looked at here, the process has been even slower. It’s still true today that statistics-driven decisions aren’t mainstream. Things are getting better, and if we look at the aggregate cost of the non-optimal decisions mentioned here, the aggregate cost has been getting lower over the past couple decades as intuition-driven decisions slowly converge to more closely match what stats nerds have been saying for decades. For example, if we look at the total number of sac bunts recorded across all teams from 1999 until now, we see:

1999200020012002200320042005200620072008200920102011201220132014201520162017
160416281607163316261731162016511540152616351544166714791383134312001025925

Despite decades of statistical evidence that sac bunts are overused, we didn’t really see a decline across all teams until 2012 or so. Why this is varies on a team-by-team and case-by-case basis, but the fundamental story that’s been repeated over and over again both for statistically-driven team composition and statistically driven in-game decisions is that the people who have the power to make decisions often stick to conventional wisdom instead of using “radical” statistically-driven ideas. There are a number of reasons as to why this happens. One high-level reason is that the change we’re talking about was a cultural change and cultural change is slow. Even as this change was happening and teams that were more data-driven were outperforming relative to their budgets, people anti-data folks ridiculed anyone who was using data. If you were one of the early data folks, you'd have to be willing to tolerate a lot of the biggest names in the game calling you stupid, as well as fans, friends, etc.. It doesn’t surprise people when it takes a generation for scientific consensus to shift in the face of this kind of opposition, so why should be baseball be any different?

One specific lower-level reason “obviously” non-optimal decisions can persist for so long is that there’s a lot of noise in team results. You sometimes see a manager make some radical decisions (not necessarily statistics-driven), followed by some poor results, causing management to fire the manager. There’s so much volatility that you can’t really judge players or managers based on small samples, but this doesn’t stop people from doing so. The combination of volatility and skepticism of radical ideas heavily disincentivizes going against conventional wisdom.

Among the many consequences of this noise is the fact that the winner of the "world series" (the baseball championship) is heavily determined by randomness. Whether or not a team makes the playoffs is determined over 162 games, which isn't enough to remove all randomness, but is enough that the result isn't mostly determined by randomness. This isn't true of the playoffs, which are too short for the outcome to be primarily determined by the difference in the quality of teams. Once a team wins the world series, people come up with all kinds of just-so stories to justify why the team should've won, but if we look across all games, we can see that the stories are just stories. This is, perhaps, not so different to listening to people tell you why their startup was successful.

There are metrics we can use that are better predictors of future wins and losses (i.e., are less volatile than wins and losses), but, until recently, convincing people that those metrics were meaningful was also a radical idea.

Board games

That's the baseball example. Now on to the board game example. In this example, we'll look at people who make comments on "modern" board game strategy, by which I mean they comment on strategy for games like Catan, Puerto Rico, Ark Nova, etc.

People often vehemently disagree about what works and what doesn't work. Today, most online discussions of this sort happen on boardgamegeek (BGG), a forum that is, by far, the largest forum for discussing board games. A quirk of these discussions is that people often use the same username on BGG as on boardgamearena (BGA), a online boardgame site where people's ratings (Elos) are tracked and you can see people's Elo ratings.

So, in these discussions, you'll see someone saying that strategy X is dominant. Then someone else will come in and say, no, strategy Y beats strategy X, I win with strategy Y all the time when people do strategy X, etc. If you understand the game, you'll see that the person arguing for X is correct and the person arguing for Y is wrong, and then you'll look up these people's Elos and find that the X-player is a high-ranked player and the Y-player is a low-ranked player.

The thing that's odd about this is, how come the low-ranked players so confidently argue that their position is correct? Not only do they get per-game information indicating that they're wrong (because they often lose), they have a rating that aggregates all of their gameplay and tells them, roughly, how good they are. Despite this rating telling them that they don't know what they're doing in the game, they're utterly convinced that they're strong players who are playing well and that they not only have good strategies, their strategies are good enough that they should be advising much higher rated players on how to play.

When people correct these folks, they often get offended because they're sure that they're good and they'll say things like "I'm a good [game name] player. I win a lot of games", followed by some indignation that their advice isn't taken seriously and/or huffy comments about how people who think strategy X works are all engaging in group think even when these people are playing in the same pool of competitive online players where, if it were true that strategy X players were engaging in incorrect group think, strategy Y players would beat them and have higher ratings. And, as we noted when we looked at video game skill, players often express great frustration and anger at losing and not being better at the game, so it's clear that they want to do better and win. But even having a rating that pretty accurately sums on your skill displayed on your screen at all times doesn't seem to be enough to get people to realize that they're, on average, making poor decisions and could easily make better decisions by taking advice from higher-rated players instead of insisting that their losing strategies work.

When looking at the video game Overwatch, we noted that often overestimated their own skill and blamed teammates for losses. But in these kinds of boardgames, people are generally not playing on teams, so there's no one else to blame. And not only is there no teammate to blame, in most games, the most serious rated game format is 1v1 and not some kind of multi-player FFA, so you can't even blame a random person who's not on your team. In general, someone's rating in a 1v1 game is about as accurate as metric as you're going to get for someone's domain-specific decision making skill in any domain.

And yet, people are extremely confident about their own skills despite their low ratings. If you look at board game strategy commentary today, almost all of it is wrong and, when you look up people's ratings, almost all of it comes from people who are low rated in every game they play, who don't appear to understand how to play any game well. Of course there's nothing inherently wrong with playing a game poorly if that's what someone enjoys. The incongruity here comes from people playing poorly, having a well-defined rating that shows that they're playing poorly, be convinced that they're playing well and taking offence when people note that the strategies they advocate for don't work.

Life outside of games

In the world, it's rare to get evidence of the quality of our decision making that's as clear as we see in sports and board games. When making an engineering decision, you almost never have data that's as clean as you do in baseball, nor do you ever have an Elo rating that can basically accurately sum up how good your past decision making is. This makes it much easier to adjust to feedback and make good decisions in sports and board games and yet, we can observe that most decision making in sport and board games in poor. This was true basically forever in sports despite a huge amount of money being on the line, and is true in board games despite people getting quite worked up over them and seeming to care a lot.

If we think about the general version of the baseball decision we examined, what’s happening is that decisions have probabilistic payoffs. There’s very high variance in actual outcomes (wins and losses), so it’s possible to make good decisions and not see the direct effect of them for a long time. Even if there are metrics that give us a better idea of what the “true” value of a decision is, if you’re operating in an environment where your management doesn’t believe in those metrics, you’re going to have a hard time keeping your job (or getting a job in the first place) if you want to do something radical whose value is only demonstrated by some obscure-sounding metric unless they take a chance on you for a year or two. There have been some major phase changes in what metrics are accepted, but they’ve taken decades.

If we look at business or engineering decisions, the situation is much messier. If we look at product or infrastructure success as a “win”, there seems to be much more noise in whether or not a team gets a “win”. Moreover, unlike in baseball, the sort of play-by-play or even game data that would let someone analyze “wins” and “losses” to determine the underlying cause isn’t recorded, so it’s impossible to determine the true value of decisions. And even if the data were available, there are so many more factors that determine whether or not something is a “win” that it’s not clear if we’d be able to determine the expected value of decisions even if we had the data.

We’ve seen that in a field where one can sit down and determine the expected value of decisions, it can take decades for this kind of analysis to influence some important decisions. If we look at fields where it’s more difficult to determine the true value of decisions, how long should we expect it to take for “good” decision making to surface? It seems like it would be a while, perhaps forever, unless there’s something about the structure of baseball and other sports that makes it particularly difficult to remove a poor decision maker and insert a better decision maker.

One might argue that baseball is different because there are a fixed number of teams and it’s quite unusual for a new team to enter the market, but if you look at things like public clouds, operating systems, search engines, car manufacturers, etc., the situation doesn’t look that different. If anything, it appears to be much cheaper to take over a baseball team and replace management (you sometimes see baseball teams sell for roughly a billion dollars) and there are more baseball teams than there are competitive products in the markets we just discussed, at least in the U.S. One might also argue that, if you look at the structure of baseball teams, it’s clear that positions are typically not handed out based on decision-making merit and that other factors tend to dominate, but this doesn’t seem obviously more true in baseball than in engineering fields.

This isn’t to say that we expect obviously bad decisions everywhere. You might get that idea if you hung out on baseball stats nerd forums before Moneyball was published (and for quite some time after), but if you looked at formula 1 (F1) around the same time, you’d see teams employing PhDs who are experts in economics and game theory to make sure they were making reasonable decisions. This doesn’t mean that F1 teams always make perfect decisions, but they at least avoided making decisions that interested amateurs could identify as inefficient for decades. There are some fields where competition is cutthroat and you have to do rigorous analysis to survive and there are some fields where competition is more sedate. In living memory, there was a time when training for sports was considered ungentlemanly and someone who trained with anything resembling modern training techniques would’ve had a huge advantage. Over the past decade or so, we’re seeing the same kind of shift but for statistical techniques in baseball instead of training in various sports.

If we want to look at the quality of decision making, it's too simplistic to say that we expect a firm to make good decisions because they're exposed to markets and there's economic value in making good decisions and people within the firm will probably be rewarded greatly if they make good decisions. You can't even tell if this is happening by asking people if they're making rigorous, data-driven, decisions. If you'd ask people in baseball they were using data in their decisions, they would've said yes throughout the 70s and 80s. Baseball has long been known as a sport where people track all kinds of numbers and then use those numbers. It's just that people didn't backtest their predictions, let alone backtest their predictions with holdouts.

The paradigm shift of using data effectively to drive decisions has been hitting different fields at different rates over the past few decades, both inside and outside of sports. Why this change happened in F1 before it happened in baseball is due to a combination of the difference in incentive structure in F1 teams vs. baseball teams and the difference in institutional culture. We may take a look at this in a future post, but this turns out to be a fairly complicated issue that requires a lot more background.

Looking at the overall picture, we could view this glass as being half empty (wow, people suck at making easy decisions that they consider very important, so they must be absolutely horrible at making non-easy decisions) or the glass as being half full (wow, you can find good opportunities for improvement in many places, even in areas people claim must be hard due to econ 101 reasoning like "they must be making the right call because they're highly incentivized" could trick one into thinking that there aren't easy opportunities available).

Appendix: non-idealities in our baseball analysis

In order to make this a short blog post and not a book, there are a lot of simplifications the approximation we discussed. One major simplification is the idea that all runs are equivalent. This is close enough to true that this is a decent approximation. But there are situations where the approximation isn’t very good, such as when it’s the 9th inning and the game is tied. In that case, a decision that increases the probability of scoring 1 run but decreases the probability of scoring multiple runs is actually the right choice.

This is often given as a justification for a relatively late-game sac bunt. But if we look at the probability of a successful sac bunt, we see that it goes down in later innings. We didn’t talk about how the defense is set up, but defenses can set up in ways that reduce the probability of a successful sac bunt but increase the probability of success of non-bunts and vice versa. Before the last inning, this actually makes sac bunt worse late in the game and not better! If we take all of that into account in the last inning of a tie game, the probability that a sac bunt is a good idea then depends on something else we haven’t discussed, the batter at the plate.

In our simplified model, we computed the expected value in runs across all batters. But at any given time, a particular player is batting. A successful sac bunt advances runners and increases the number of outs by one. The alternative is to let the batter “swing away”, which will result in some random outcome. The better the batter, the higher the probability of an outcome that’s better than the outcome of a sac bunt. To determine the optimal decision, we not only need to know how good the current batter is but how good the subsequent batters are. One common justification for the sac bunt is that pitchers are terrible hitters and they’re not bad at sac bunting because they have so much practice doing it (because they’re terrible hitters), but it turns out that pitchers are also below average sac bunters and that the argument that we should expect pitchers to sac because they’re bad hitters doesn’t hold up if we look at the data in detail.

Another reason to sac bunt (or bunt in general) is that the tendency to sometimes do this induces changes in defense which make non-bunt plays work better.

A full computation should also take into account the number of balls and strikes a current batter has, which is a piece of state we haven’t discussed at all as well as the speed of the batter and the players on base as well as the particular stadium the game is being played in and the opposing pitcher as well as the quality of their defense. All of this can be done, even on a laptop -- this is all “small data” as far as computers are concerned, but walking through the analysis even for one particular decision would be substantially longer than everything in this post combined including this disclaimer. It’s perhaps a little surprising that taking all of these non-idealities into account doesn’t overturn the general result, but it turns out that it doesn’t (it finds that there are many situations in which sac bunts have positive expected value, but that sac bunts were still heavily overused for decades).

There’s a similar situation for intentional walks, where the non-idealities in our analysis appear to support issuing intentional walks. In particular, the two main conventional justifications for an intentional walk are

  1. By walking the current batter, we can set up a “force” or a “double play” (increase the probability of getting one out or two outs in one play). If the game is tied in the last inning, putting another player on base has little downside and has the upside of increasing the probability of allowing zero runs and continuing the tie.
  2. By walking the current batter, we can get to the next, worse batter.

An example situation where people apply the justification in (1) is in the {1B, 3B; 2 out} state. The team that’s on defense will lose if the player at 3B advances one base. The reasoning goes, walking a player and changing the state to {1B, 2B, 3B; 2 out} won’t increase the probability that the player at 3B will score and end the game if the current batter “puts the ball into play”, and putting another player on base increases the probability that the defense will be able to get an out.

The hole in this reasoning is that the batter won’t necessarily put the ball into play. After the state is {1B, 2B, 3B; 2 out}, the pitcher may issue an unintentional walk, causing each runner to advance and losing the game. It turns out that being in this state doesn’t affect the the probability of an unintentional walk very much. The pitcher tries very hard to avoid a walk but, at the same time, the batter tries very hard to induce a walk!

On (2), the two situations where the justification tend to be applied are when the current player at bat is good or great, or the current player is batting just before the pitcher. Let’s look at these two separately.

Barry Bonds’s seasons from 2001, 2002, and 2004 were some of the statistically best seasons of all time and are as extreme a case as one can find in modern baseball. If we run our same analysis and account for the quality of the players batting after Bonds, we find that it’s sometimes the correct decision for the opposing team to intentionally walk Bonds, but it was still the case that most situations do not warrant an intentional walk and that Bonds was often intentionally walked in a situation that didn’t warrant an intentional walk. In the case of a batter who is not having one of the statistically best seasons on record in modern baseball, intentional walks are even less good.

In the case of the pitcher batting, doing the same kind of analysis as above also reveals that there are situations where an intentional walk are appropriate (not-late game, {1B, 2B; 2 out}, when the pitcher is not a significantly above average batter for a pitcher). Even though it’s not always the wrong decision to issue an intentional walk, the intentional walk is still grossly overused.

One might argue the fact that our simple analysis has all of these non-idealities that could have invalidated the analysis is a sign that decision making in baseball wasn’t so bad after all, but I don’t think that holds. A first-order approximation that someone could do in an hour or two finds that decision making seems quite bad, on average. If a team was interested in looking at data, that ought to lead them into doing a more detailed analysis that takes into account the conventional-wisdom based critiques of the obvious one-hour analysis. It appears that this wasn’t done, at least not for decades.

The problem is that before people started running the data, all we had to go by were stories. Someone would say "with 2 outs, you should walk the batter before the pitcher to get to the pitcher [in some situations] to get to the pitcher and get the guaranteed out". Someone else might respond "we obviously shouldn't do that late game because the pitcher will get subbed out for a pinch hitter and early game, we shouldn't do it because even if it works and we get the easy out, it sets the other team up to lead off the next inning with their #1 hitter instead of an easy out". Which of these stories is the right story turns out to be an empirical question. The thing that I find most unfortunate is that, after started people running the numbers and the argument became one of stories vs. data, people persisted in sticking with the story-based argument for decades. We see the same thing in business and engineering, but it's arguably more excusable there because decisions in those areas tend to be harder to quantify. Even if you can reduce something to a simple engineering equation, someone can always argue that the engineering decision isn't what really matters and this other business concern that's hard to quantify is the most important thing.

Appendix: possession

Something I find interesting is that statistical analysis in football, baseball, and basketball has found that teams have overwhelmingly undervalued possessions for decades. Baseball doesn't have the concept of possession per se, but if you look at being on offense as "having possession" and getting 3 outs as "losing possession", it's quite similar.

In football, we see that maintaining possession is such a big deal that it is usually an error to punt on 4th down, but this hasn't stopped teams from punting by default basically forever. And in basketball, players who shoot a lot with a low shooting percentage were (and arguably still are) overrated.

I don't think this is fundamental -- that possessions are as valuable as they are comes out of the rules of each game. It's arbitrary. I still find it interesting, though.

Appendix: other analysis of management decisions

Bloom et al., Does management matter? Evidence from India looks at the impact of management interventions and the effect on productivity.

Other work by Bloom.

DellaVigna et al., Uniform pricing in US retail chains allegedly finds a significant amount of money left on the table by retail chains (seven percent of profits) and explores why that might happen and what the impacts are.

The upside of work like this vs. sports work is that it attempts to quantify the impact of things outside of a contrived game. The downside is that the studies are on things that are quite messy and it's hard to tell what the study actually means. Just for example, if you look at studies on innovation, economists often use patents as a proxy for innovation and then come to some conclusion based on some variable vs. number of patents. But if you're familiar with engineering patents, you'll know that number of patents is an incredibly poor proxy for innovation. In the hardware world, IBM is known for cranking out a very large number of useless patents (both in the sense of useless for innovation and also in the narrow sense of being useless as a counter-attack in patent lawsuits) and there are some companies that get much more mileage out of filing many fewer patents.

AFAICT, our options here are to know a lot about decisions in a context that's arguably completely irrelevant, or to have ambiguous information and probably know very little about a context that seems relevant to the real world. I'd love to hear about more studies in either camp (or even better, studies that don't have either problem).

Thanks to Leah Hanson, David Turner, Milosz Dan, Andrew Nichols, Justin Blank, @hoverbikes, Kate Murphy, Ben Kuhn, Patrick Collison, and an anonymous commenter for comments/corrections/discussion.

How out of date are Android devices?

2017-11-12 08:00:00

It's common knowledge that Android device tend to be more out of date than iOS devices, but what does this actually mean? Let’s look at android marketshare data to see how old devices in the wild are. The x axis of the plot below is date, and the y axis is Android marketshare. The share of all devices sums to 100% (with some artifacts because the public data Google provides is low precision).

Color indicates age:

  • blue: current (API major version)
  • yellow: 6 months
  • orange: 1 year
  • dark red: 2 years
  • bright red/white: 3 years
  • light grey: 4 years
  • grey: 5 years
  • black: 6 years or more

If we look at the graph, we see a number of reverse-S shaped contours; between each pair of contours, devices get older as we go from left to right. Each contour corresponds to the release of a new android version and the associated devices running that android version. As time passes, devices on that version get older. When a device is upgraded, they’re effectively removed from one contour into a new contour and the color changes to a less outdated color.

Markshare of outdated android devices is increasing

There are three major ways in which this graph understates the number of outdated devices:

First, we’re using API version data for this and don’t have access to the marketshare of point releases and minor updates, so we assume that all devices on the same API version are up to date until the moment a new API version is released, but many (and perhaps most) devices won’t receive updates within an API version.

Second, this graph shows marketshare, but the number of Android devices has dramatically increased over time. For example, if we look at the 80%-ile most outdated devices (i.e., draw a line 20% up from the bottom), it the 80%-ile device today is a few months more outdated than it was in 2014. The huge growth of Android means that there are many many more outdated devices now than there were in 2014.

Third, this data comes from scraping Google Play Store marketshare info. That data shows marketshare of devices that have visited in the Play Store in the last 7 days. In general, it seems reasonable to believe that devices that visit the play store are more up to date than devices that don’t, so we should expect an unknown amount of bias in this data that causes the graph to show that devices are newer than they actually are. This seems plausible both for devices that are used as conventional mobile devices as well as for mobile devices that have replaced things liked traditonally embedded devices, PoS boxes, etc.

If we're looking at this from a security standpoint, some devices will receive updates without updating their major version, skewing the date to look more outdated than it used it. However, when researchers have used more fine-grained data to see which devices are taking updates, they found that this was not a large effect.

One thing we can see from that graph is that, as time goes on, the world accumulates a larger fraction of old devices over time. This makes sense and we could have figured this out without looking at the data. After all, back at the beginning of 2010, Android phones couldn’t be much more than a year old, and now it’s possible to have Android devices that are nearly a decade old.

Something that wouldn’t have been obvious without looking at the data is that the uptake of new versions seems to be slowing down -- we can see this by looking at the last few contour lines at the top right of the graph, corresponding to the most recent Android releases. These lines have a shallower slope than the contour lines for previous releases. Unfortunately, with this data alone, we can’t tell why the slope is shallower. Some possible reasons might be:

  • Android growth is slowing down
  • Android device turnover (device upgrade rate) is slowing down
  • Fewer devices are receiving updates

Without more data, it’s impossible to tell how much each of these is contributing to the problem. BTW, let me know if you know of a reasonable source for the active number of Android devices going back to 2010! I’d love to produce a companion graph of the total number of outdated devices.

But even with the data we have, we can take a guess at how many outdated devices are in use. In May 2017, Google announced that there are over two billion active Android devices. If we look at the latest stats (the far right edge), we can see that nearly half of these devices are two years out of date. At this point, we should expect that there are more than one billion devices that are two years out of date! Given Android's update model, we should expect approximately 0% of those devices to ever get updated to a modern version of Android.

Percentiles

Since there’s a lot going on in the graph, we might be able to see something if we look at some subparts of the graph. If we look at a single horizontal line across the graph, that corresponds to the device age at a certain percentile:

Over time, the Nth percentile out of date device is getting more out of date

In this graph, the date is on the x axis and the age in months is on the y axis. Each line corresponds to a different percentile (higher percentile is older), which corresponds to a horizontal slice of the top graph at that percentile.

Each individual line seems to have two large phases (with some other stuff, too). There’s one phase where devices for that percentile get older as quickly as time is passing, followed by a phase where, on average, devices only get slightly older. In the second phase, devices sometimes get younger as new releases push younger versions into a certain percentile, but this doesn’t happen often enough to counteract the general aging of devices. Taken as a whole, this graph indicates that, if current trends continue, we should expect to see proportionally more old Android devices as time goes on, which is exactly what we’d expect from the first, busier, graph.

Dates

Another way to look at the graph is to look at a vertical slice instead of a horizontal slice. In that case, each slice corresponds to looking at the ages of devices at one particular date:

In this plot, the x axis indicates the age percentile and the y axis indicates the raw age in months. Each line is one particular date, with older dates being lighter / yellower and newer dates being darker / greener.

As with the other views of the same data, we can see that Android devices appear to be getting more out of date as time goes on. This graph would be too busy to read if we plotted data for all of the dates that are available, but we can see it as an animation:

iOS

For reference, iOS 11 was released two months ago and it now has just under 50% iOS marketshare despite November’s numbers coming before the release of the iPhone X (this is compared to < 1% marketshare for the latest Android version, which was released in August). It’s overwhelmingly likely that, by the start of next year, iOS 11 will have more than 50% marketshare and there’s an outside chance that it will have 75% marketshare, i.e., it’s likely that the corresponding plot for iOS would have the 50%-ile (red) line in the second plot at age = 0 and it’s not implausible that the 75%-ile (orange) line would sometimes dip down to 0. As is the case with Android, there are some older devices that stubbornly refuse to update; iOS 9.3, released a bit over two years ago, sits at just a bit above 5% marketshare. This means that, in the iOS version of the plot, it’s plausible that we’d see the corresponding 99%-ile (green) line in the second plot at a bit over two years (half of what we see for the Android plot).

Windows XP

People sometimes compare Android to Windows XP because there are a large number of both in the wild and in both cases, most devices will not get security updates. However, this is tremendously unfair to Windows XP, which was released on 10/2001 and got security updates until 4/2014, twelve and a half years later. Additionally, Microsoft has released at least one security update after the official support period (there was an update in 5/2017 in response to the WannaCry ransomware). It's unfortunate that Microsoft decided to end support for XP while there are still so many XP boxes in the wild, but supporting an old OS for over twelve years and then issuing an emergency security patch after more fifteen years puts Microsoft into a completely different league than Google and Apple when it comes to device support.

Another difference between Android and Windows is that Android's scale is unprecedented in the desktop world. The were roughly 200 million PCs sold in 2017. Samsung alone has been selling that many mobile devices per year since 2008. Of course, those weren't Android devices in 2008, but Android's dominance in the non-iOS mobile space means that, overall, those have mostly been Android devices. Today, we still see nearly 50 year old PDP-11 devices in use. There are few enough PDPs around that running into one is a cute, quaint, surprise (0.6 million PDP-11s were sold). Desktops boxes age out of service more quickly than PDPs and mobile devices age out of service even more quickly, but the sheer difference in number of devices caused by the ubiquity of modern computing devices means that we're going to see many more XP-era PCs in use 50 years after the release of XP and it's plausible we'll see even more mobile devices around 50 years from now. Many of these ancient PDP, VAX, DOS, etc. boxes are basically safe because they're run in non-networked configurations, but it looks like the same thing is not going to be true for many of these old XP and Android boxes that are going to stay in service for decades.

Conclusion

We’ve seen that Android devices appear to be getting more out of date over time. This makes it difficult for developers to target “new” Android API features, where new means anything introduced in the past few years. It also means that there are a lot of Android devices out there that are behind in terms of security. This is true both in absolute terms and also relative to iOS.

Until recently, Android was directly tied to the hardware it ran on, making it very painful to keep old devices up to date because that requiring a custom Android build with phone-specific (or at least SoC-specific work). Google claims that this problem is fixed in the latest Android version (8.0, Oreo). People who remember Google's "Android update alliance" annoucement in 2011 may be a bit skeptical of the more recent annoucement. In 2011, Google and U.S. carries announced that they'd keep devices up to date for 18 months, which mostly didn't happen. However, even if the current annoucement isn't smoke and mirrors and the latest version of Android solves the update probem, we've seen that it takes years for Android releases to get adopted and we've also seen that the last few Android releases have significantly slower uptake than previous releases. Additionally, even though this is supposed to make updates easier, it looks like Android is still likely to stay behind iOS in terms of updates for a while. Google has promised that its latest phone (Pixel 2, 10/2017) will get updates for three years. That seems like a step in the right direction, but as we’ve seen from the graphs above, extending support by a year isn’t nearly enough to keep most Android devices up to date. But if you have an iPhone, the latest version of iOS (released 9/2017) works on devices back to the iPhone 5S (released 9/2013).

If we look at the newest Android release (8.0, 8/2017), it looks like you’re quite lucky if you have a two year old device that will get the latest update. The oldest “Google” phone supported is the Nexus 6P (9/2015), giving it just under two years of support.

If you look back at devices that were released around when the iPhone5S, the situation looks even worse. Back then, I got a free Moto X for working at Google; the Moto X was about as close to an official Google phone as you could get at the time (this was back when Google owned Moto). The Moto X was released on 8/2013 (a month before the iPhone 5S) and the latest version of Android it supports is 5.1, which was released on 2/2015, a little more than a year and a half later. For an Android phone of its era, the Moto X was supported for an unusually long time. It's a good sign that things look worse as look further back in time, but at the rate things are improving, it will be years before there's a decently supported Android device released and then years beyond those years before that Android version is in widespread use. It's possible that Fuchsia will fix this, but Fuchsia is also many years away from widespread use.

In a future post, we'll look at Android response latency, which is much higher than iPhone and iPad latency.

Thanks to Leah Hanson, Kate Murphy, Daniel Thomas, Marek Majkowski, @zofrex, @Aissn, Chris Palmer, JonLuca De Caro, and an anonymous person for comments/corrections/related discussion.

Also, thanks to Victorien Villard for making the data these graphs were based on available!

UI backwards compatibility

2017-11-09 08:00:00

About once a month, an app that I regularly use will change its UI in a way that breaks muscle memory, basically tricking the user into doing things they don’t want.

Zulip

In recent memory, Zulip (a slack competitor) changed its newline behavior so that ctrl + enter sends a message instead of inserting a new line. After this change, I sent a number of half-baked messages and it seemed like some other people did too.

Around the time they made that change, they made another change such that a series of clicks that would cause you to send a private message to someone would instead cause you to send a private message to the alphabetically first person who was online. Most people didn’t notice that this was a change, but when I mentioned that this had happened to me a few times in the past couple weeks, multiple people immediately said that the exact same thing happened to them. Some people also mentioned that the behavior of navigation shortcut keys was changed in a way that could cause people to broadcast a message instead of sending a private message. In both cases, some people blamed themselves and didn’t know why they’d just started making mistakes that caused them to send messages to the wrong place.

Doors

A while back, I was at Black Seed Bagel, which has a door that looks 75% like a “push” door from both sides when it’s actually a push door from the outside and a pull door from the inside. An additional clue that makes it seem even more like a "push" door from the inside is that most businesses have outward opening doors (this is required for exit doors in the U.S. when the room occupancy is above 50 and many businesses in smaller spaces voluntarily follow the same convention). During the course of an hour long conversation, I saw a lot of people go in and out and my guess is that ten people failed on their first attempt to use the door while exiting. When people were travelling in pairs or groups, the person in front would often say something like “I’m dumb. We just used this door a minute ago”. But the people were not, in fact, acting dumb. If anything is dumb, it’s designing doors such that are users have to memorize which doors act like “normal” doors and which doors have their cues reversed.

If you’re interested in the physical world, The Design of Everyday Things, gives many real-world examples where users are subtly nudged into doing the wrong thing. It also discusses general principles in a way that allows you to see the general idea and apply and avoid the same issues when designing software.

Facebook

Last week, FB changed its interface so that my normal sequence of clicks to hide a story saves the story instead of hiding it. Saving is pretty much the opposite of hiding! It’s the opposite both from the perspective of the user and also as a ranking signal to the feed ranker. The really “great” thing about a change like this is that it A/B tests incredibly well if you measure new feature “engagement” by number of clicks because many users will accidentally save a story when they meant to hide it. Earlier this year, twitter did something similar by swapping the location of “moments” and “notifications”.

Even if the people making the change didn’t create the tricky interface in order to juice their engagement numbers, this kind of change is still problematic because it poisons analytics data. While it’s technically possible to build a model to separate out accidental clicks vs. purposeful clicks, that’s quite rare (I don’t know of any A/B tests where people have done that) and even in cases where it’s clear that users are going to accidentally trigger an action, I still see devs and PMs justify a feature because of how great it looks on naive statistics like DAU/MAU.

API backwards compatibility

When it comes to software APIs, there’s a school of thought that says that you should never break backwards compatibility for some classes of widely used software. A well-known example is Linus Torvalds:

People should basically always feel like they can update their kernel and simply not have to worry about it.

I refuse to introduce "you can only update the kernel if you also update that other program" kind of limitations. If the kernel used to work for you, the rule is that it continues to work for you. … I have seen, and can point to, lots of projects that go "We need to break that use case in order to make progress" or "you relied on undocumented behavior, it sucks to be you" or "there's a better way to do what you want to do, and you have to change to that new better way", and I simply don't think that's acceptable outside of very early alpha releases that have experimental users that know what they signed up for. The kernel hasn't been in that situation for the last two decades. ... We do API breakage inside the kernel all the time. We will fix internal problems by saying "you now need to do XYZ", but then it's about internal kernel API's, and the people who do that then also obviously have to fix up all the in-kernel users of that API. Nobody can say "I now broke the API you used, and now you need to fix it up". Whoever broke something gets to fix it too. ... And we simply do not break user space.

Raymond Chen quoting Colen:

Look at the scenario from the customer’s standpoint. You bought programs X, Y and Z. You then upgraded to Windows XP. Your computer now crashes randomly, and program Z doesn’t work at all. You’re going to tell your friends, "Don’t upgrade to Windows XP. It crashes randomly, and it’s not compatible with program Z." Are you going to debug your system to determine that program X is causing the crashes, and that program Z doesn’t work because it is using undocumented window messages? Of course not. You’re going to return the Windows XP box for a refund. (You bought programs X, Y, and Z some months ago. The 30-day return policy no longer applies to them. The only thing you can return is Windows XP.)

While this school of thought is a minority, it’s a vocal minority with a lot of influence. It’s much rarer to hear this kind of case made for UI backwards compatibility. You might argue that this is fine -- people are forced to upgrade nowadays, so it doesn’t matter if stuff breaks. But even if users can’t escape, it’s still a bad user experience.

The counterargument to this school of thought is that maintaining compatibility creates technical debt. It’s true! Just for example, Linux is full of slightly to moderately wonky APIs due to the “do not break user space” dictum. One example is int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags, struct timespec *timeout); . You might expect the timeout to fire if you don’t receive a packet, but the manpage reads:

The timeout argument points to a struct timespec (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (but see BUGS!).

The BUGS section reads:

The timeout argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to vlen-1 datagrams are received before the timeout expires, but then no further datagrams are received, the call will block forever.

This is arguably not even the worst mis-feature of recvmmsg, which returns an ssize_t into a field of size int.

If you have a policy like “we simply do not break user space”, this sort of technical debt sticks around forever. But it seems to me that it’s not a coincidence that the most widely used desktop, laptop, and server operating systems in the world bend over backwards to maintain backwards compatibility.

The case for UI backwards compatability is arguably stronger than the case for API backwards compatability because breaking API changes can be mechanically fixed and, with the proper environment, all callers can be fixed at the same time as the API changes. There's no equivalent way to reach into people's brains and change user habits, so a breaking UI change inevitably results in pain for some users.

The case for the case for UI backwards compatibility is arguably weaker than the case for API backwards compatibility because API backwards compatibility has a lower cost -- if some API is problematic, you can make a new API and then document the old API as something that shouldn’t be used (you’ll see lots of these if you look at Linux syscalls). This doesn’t really work with GUIs since UI elements compete with each other for a small amount of screen real-estate. An argument that I think is underrated is that changing UIs isn’t as great as most companies seem to think -- very dated looking UIs that haven’t been refreshed to keep up with trends can be successful (e.g., plentyoffish and craigslist). Companies can even become wildly successful without any significant UI updates, let alone UI redesigns -- a large fraction of linkedin’s rocketship growth happened in a period where the UI was basically frozen. I’m told that freezing the UI wasn’t a deliberate design decision; instead, it was a side effect of severe technical debt, and that the UI was unfrozen the moment a re-write allowed people to confidently change the UI. Linkedin has managed to add a lot of dark patterns since they unfroze their front-end, but the previous UI seemed to work just fine in terms of growth.

Despite the success of a number of UIs which aren’t always updated to track the latest trends, at most companies, it’s basically impossible to make the case that UIs shouldn’t be arbitrarily changed without adding functionality, let alone make the case that UIs shouldn’t push out old functionality with new functionality.

UI deprecation

A case that might be easier to make is that shortcuts and shortcut-like UI elements can be deprecated before removal, similar to the way evolving APIs will add deprecation warnings before making breaking changes. Instead of regularly changing UIs so that users’ muscle memory is used against them and causes users to do the opposite of what they want, UIs can be changed so that doing the previously trained set of actions causes nothing to happen. For example, FB could have moved “hide post” down and inserted a no-op item in the old location, and then after people had gotten used to not clicking in the old “hide post” location for “hide post”, they could have then put “save post” in the old location for “hide post”.

Zulip could’ve done something similar and caused the series of actions that used to let you send a private message to the person you want cause no message to be sent instead of sending a private message to the alphabetically first person on the online list.

These solutions aren’t ideal because the user still has to retrain their muscle memory on the new thing, but it’s still a lot better than the current situation, where many UIs regularly introduce arbitrary-seeming changes that sow confusion and chaos.

In some cases (e.g., the no-op menu item), this presents a pretty strange interface to new users. Users don’t expect to see a menu item that does nothing with an arrow that says to click elsewhere on the menu instead. This can be fixed by only rolling out deprecation “warnings” to users who regularly use the old shortcut or shortcut-like path. If there are multiple changes being deprecated, this results in a combinatorial explosion of possibilities, but if you're regularly deprecating multiple independent items, that's pretty extreme and users are probably going to be confused regardless of how it's handled. Given the amount of effort made to avoid user hostile changes and the dominance of the “move fast and break things” mindset, the case for adding this kind of complexity just to avoid giving users a bad experience probably won’t hold at most companies, but this at least seems plausible in principle.

Breaking existing user workflows arguably doesn’t matter for an app like FB, which is relatively sticky as a result of its dominance in its area, but most applications are more like Zulip than FB. Back when Zulip and Slack were both young, Zulip messages couldn’t be edited or deleted. This was on purpose -- messages were immutable and everyone I know who suggested allowing edits was shot down because mutable messages didn’t fit into the immutable model. Back then, if there was a UI change or bug that caused users to accidentally send a public message instead of a private message, that was basically permanent. I saw people accidentally send public messages often enough that I got into the habit of moving private message conversations to another medium. That didn’t bother me too much since I’m used to quirky software, but I know people who tried Zulip back then and, to this day, still refuse to use Zulip due to UI issues they hit back then. That’s a bit of an extreme case, but the general idea that users will tend to avoid apps that repeatedly cause them pain isn’t much of a stretch.

In studies on user retention, it appears to be the case that an additional 500ms of page-load latency negative impacts retention. If that's the case, it seems like switching the UI around so that the user has to spend 5s undoing and action or broadcasts a private message publicly in a way that can't be undone should have a noticable impact on retention, although I don't know of any public studies that look at this.

Conclusion

If I worked on UI, I might have some suggestions or a call to action. But as an outsider, I’m wary of making actual suggestions -- programmers seem especially prone to coming into an area they’re not familiar with and telling experts how they should solve their problems. While this occasionally works, the most likely outcome is that the outsider either re-invents something that’s been known for decades or completely misses the most important parts of the problem.

It sure would be nice if shortcuts didn’t break so often that I spend as much time consciously stopping myself from using shortcuts as I do actually using the app. But there are probably reasons this is difficult to test/enforce. The huge number of platforms that need to be tested for robust UI testing make testing hard even without adding this extra kind of test. And, even when we’re talking about functional correctness problems, “move fast and break things” is much trendier than “try to break relatively few things”. Since UI “correctness” often has even lower priority than functional correctness, it’s not clear how someone could successfully make a case for spending more effort on it.

On the other hand, despite all these disclaimers, Google sometimes does the exact things described in this post. Chrome recently removed backspace to go backwards; if you hit backspace, you get a note telling you to use alt+left instead and when maps moved some items around a while back, they put in no-op placeholders that pointed people to the new location. This doesn't mean that Google always does this well -- on April fools day of 2016, gmail replaced send and archive with send and attach a gif that's offensive in some contexts -- but these examples indicate that maintaining backwards compatibility through significant changes isn't just a hypothetical idea, it can and has been done.

Thanks to Leah Hanson, Allie Jones, Randall Koutnik, Kevin Lynagh, David Turner, Christian Ternus, Ted Unangst, Michael Bryc, Tony Finch, Stephen Tigner, Steven McCarthy, Julia Evans, @BaudDev, and an anonymous person who has a moral objection to public acknowledgements for comments/corrections/discussion.

If you're curious why "anon" is against acknowledgements, it's because they first saw these in Paul Graham's writing, whose acknowledgements are sort of a who's who of SV. anon's belief is that these sorts of list serve as a kind of signalling. I won't claim that's wrong, but I get a lot of help with my writing both from people reading drafts and also from the occasional helpful public internet comment and I think it's important to make it clear that this isn't a one-person effort to combat what Bunnie Huang calls "the idol effect".

In a future post, we'll look at empirical work on how line length affects readability. I've read every study I could find, but I might be missing some. If know of a good study you think I should include, please let me know.

Filesystem error handling

2017-10-23 08:00:00

We’re going to reproduce some results from papers on filesystem robustness that were written up roughly a decade ago: Prabhakaran et al. SOSP 05 paper, which injected errors below the filesystem and Gunawi et al. FAST 08, which looked at how often filesystems failed to check return codes of functions that can return errors.

Prabhakaran et al. injected errors at the block device level (just underneath the filesystem) and found that ext3, resierfs, ntfs, and jfs mostly handled read errors reasonbly but ext3, ntfs, and jfs mostly ignored write errors. While the paper is interesting, someone installing Linux on a system today is much more likely to use ext4 than any of the now-dated filesystems tested by Prahbhakaran et al. We’ll try to reproduce some of the basic results from the paper on more modern filesystems like ext4 and btrfs, some legacy filesystems like exfat, ext3, and jfs, as well as on overlayfs.

Gunawi et al. found that errors weren’t checked most of the time. After we look at error injection on modern filesystems, we’ll look at how much (or little) filesystems have improved their error handling code.

Error injection

A cartoon view of a file read might be: pread syscall -> OS generic filesystem code -> filesystem specific code -> block device code -> device driver -> device controller -> disk. Once the disk gets the request, it sends the data back up: disk -> device controller -> device driver -> block device code -> filesystem specific code -> OS generic filesystem code -> pread. We’re going to look at error injection at the block device level, right below the file system.

Let’s look at what happened when we injected errors in 2017 vs. what Prabhakaran et al. found in 2005.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfspropproppropproppropprop
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

Each row shows results for one filesystem. read and write indicating reading and writing data, respectively, where the block device returns an error indicating that the operation failed. silent indicates a read failure (incorrect data) where the block device didn’t indicate an error. This could happen if there’s disk corruption, a transient read failure, or a transient write failure silently caused bad data to be written. file indicates that the operation was done on a file opened with open and mmap indicates that the test was done on a file mapped with mmap. ignore (red) indicates that the error was ignored, prop (yellow) indicates that the error was propagated and that the pread or pwrite syscall returned an error code, and fix (green) indicates that the error was corrected. No errors were corrected. Grey entries indicate configurations that weren’t tested.

From the table, we can see that, in 2005, ext3 and jfs ignored write errors even when the block device indicated that the write failed and that things have improved, and that any filesystem you’re likely to use will correctly tell you that a write failed. jfs hasn’t improved, but jfs is now rarely used outside of legacy installations.

No tested filesystem other than btrfs handled silent failures correctly. The other filesystems tested neither duplicate nor checksum data, making it impossible for them to detect silent failures. zfs would probably also handle silent failures correctly but wasn’t tested. apfs, despite post-dating btrfs and zfs, made the explicit decision to not checksum data and silently fail on silent block device errors. We’ll discuss this more later.

In all cases tested where errors were propagated, file reads and writes returned EIO from pread or pwrite, respectively; mmap reads and writes caused the process to receive a SIGBUS signal.

The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test. The table below tests the same thing, but with a 445 byte file instead of an 8k file. The choice of 445 was arbitrary.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfsfixfixfixfixfixfix
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

In the small file test table, all the results are the same, except for btrfs, which returns correct data in every case tested. What’s happening here is that the filesystem was created on a rotational disk and, by default, btrfs duplicates filesystem metadata on rotational disks (it can be configured to do so on SSDs, but that’s not the default). Since the file was tiny, btrfs packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure.

Overlay

Overlayfs allows one file system to be “overlaid” on another. As explained in the initial commit, one use case might be to put an (upper) read-write directory tree on top of a (lower) read-only directory tree, where all modifications go to the upper, writable layer.

Although not listed on the tables, we also tested every filesystem other than fat as the lower filesystem with overlay fs (ext4 was the upper filesystem for all tests). Every filessytem tested showed the same results when used as the bottom layer in overlay as when used alone. fat wasn’t tested because mounting fat resulted in a filesystem not supported error.

Error correction

btrfs doesn’t, by default, duplicate metadata on SSDs because the developers believe that redundancy wouldn’t provide protection against errors on SSD (which is the same reason apfs doesn’t have redundancy). SSDs do a kind of write coalescing, which is likely to cause writes which happen consecutively to fall into the same block. If that block has a total failure, the redundant copies would all be lost, so redundancy doesn’t provide as much protection against failure as it would on a rotational drive.

I’m not sure that this means that redundancy wouldn’t help -- Individual flash cells degrade with operation and lose charge as they age. SSDs have built-in wear-leveling and error-correction that’s designed to reduce the probability that a block returns bad data, but over time, some blocks will develop so many errors that the error-correction won’t be able to fix the error and the block will return bad data. In that case, a read should return some bad bits along with mostly good bits. AFAICT, the publicly available data on SSD error rates seems to line up with this view.

Error detection

Relatedly, it appears that apfs doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”. Publicly available studies on SSD reliability have not found that there’s a model that doesn’t sometimes return bad data. It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:

The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

While there is one sense in which SSDs are more reliable than rotational disks, there’s also a sense in which they appear to be less reliable. It’s not impossible that Apple uses some kind of custom firmware on its drive that devotes more bits to error correction than you can get in publicly available disks, but even if that’s the case, you might plug a non-apple drive into your apple computer and want some kind of protection against data corruption.

Internal error handling

Now that we’ve reproduced some tests from Prabhakaran et al., we’re going to move on to Gunawi et al.. Since the paper is fairly involved, we’re just going to look at one small part of the paper, the part where they examined three function calls, filemap_fdatawait, filemap_fdatawrite, and sync_blockdev to see how often errors weren’t checked for these functions.

Their justification for looking at these function is given as:

As discussed in Section 3.1, a function could return more than one error code at the same time, and checking only one of them suffices. However, if we know that a certain function only returns a single error code and yet the caller does not save the return value properly, then we would know that such call is really a flaw. To find real flaws in the file system code, we examined three important functions that we know only return single error codes: sync_blockdev, filemap_fdatawrite, and filemap_fdatawait. A file system that does not check the returned error codes from these functions would obviously let failures go unnoticed in the upper layers.

Ignoring errors from these functions appears to have fairly serious consequences. The documentation for filemap_fdatawait says:

filemap_fdatawait — wait for all under-writeback pages to complete ... Walk the list of under-writeback pages of the given address space and wait for all of them. Check error status of the address space and return it. Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.

The comment next to the code for sync_blockdev reads:

Write out and wait upon all the dirty data associated with a block device via its mapping. Does not take the superblock lock.

In both of these cases, it appears that ignoring the error code could mean that data would fail to get written to disk without notifying the writer that the data wasn’t actually written?

Let’s look at how often calls to these functions didn’t completely ignore the error code:

fn 2008 '08 % 2017 '17 %
filemap_fdatawait 7 / 29 24 12 / 17 71
filemap_fdatawrite 17 / 47 36 13 / 22 59
sync_blockdev 6 / 21 29 7 / 23 30

This table is for all code in linux under fs. Each row shows data for calls of one function. For each year, the leftmost cell shows the number of calls that do something with the return value over the total number of calls. The cell to the right shows the percentage of calls that do something with the return value. “Do something” is used very loosely here -- branching on the return value and then failing to handle the error in either branch, returning the return value and having the caller fail to handle the return value, as well as saving the return value and then ignoring it are all considered doing something for the purposes of this table.

For example Gunawi et al. noted that cifs/transport.c had

int SendReceive () { 
    int rc;
    rc = cifs_sign_smb(); // 
    ... 
    rc = smb_send();
}

Although cifs_sign_smb returned an error code, it was never checked before being overwritten by smb_send, which counted as being used for our purposes even though the error wasn’t handled.

Overall, the table appears to show that many more errors are handled now than were handled in 2008 when Gunawi et al. did their analysis, but it’s hard to say what this means from looking at the raw numbers because it might be ok for some errors not to be handled and different lines of code are executed with different probabilities.

Conclusion

Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.

Most filesystems don’t have checksums for data and leave error detection and correction up to userspace software. When I talk to server-side devs at big companies, their answer is usually something like “who cares? All of our file accesses go through a library that checksums things anyway and redundancy across machines and datacenters takes care of failures, so we only need error detection and not correction”. While that’s true for developers at certain big companies, there’s a lot of software out there that isn’t written robustly and just assumes that filesystems and disks don’t have errors.

This was a joint project with Wesley Aptekar-Cassels; the vast majority of the work for the project was done while pair programming at RC. We also got a lot of help from Kate Murphy. Both Wesley ([email protected]) and Kate ([email protected]) are looking for work. They’re great and I highly recommend talking to them if you’re hiring!

Appendix: error handling in C

A fair amount of effort has been applied to get error handling right. But C makes it very easy to get things wrong, even when you apply a fair amount effort and even apply extra tooling. One example of this in the code is the submit_one_bio function. If you look at the definition, you can see that it’s annotated with __must_check, which will cause a compiler warning when the result is ignored. But if you look at calls of submit_one_bio, you’ll see that its callers aren’t annotated and can ignore errors. If you dig around enough you’ll find one path of error propagation that looks like:

submit_one_bio
submit_extent_page
__extent_writepage
extent_write_full_page
write_cache_pages
generic_writepages
do_writepages
__filemap_fdatawrite_range
__filemap_fdatawrite
filemap_fdatawrite

Nine levels removed from submit_one_bio, we see our old friend, `filemap_fdatawrite, which we know often doesn’t get checked for errors.

There's a very old debate over how to prevent things like this from accidentally happening. One school of thought, which I'll call the Uncle Bob (UB) school believes that we can't fix these kinds of issues with tools or processes and simply need to be better programmers in order to avoid bugs. You'll often hear people of the UB school say things like, "you can't get rid of all bugs with better tools (or processes)". In his famous and well-regarded talk, Simple Made Easy, Rich Hickey says

What's true of every bug found in the field?

[Audience reply: Someone wrote it?] [Audience reply: It got written.]

It got written. Yes. What's a more interesting fact about it? It passed the type checker.

[Audience laughter]

What else did it do?

[Audience reply: (Indiscernible)]

It passed all the tests. Okay. So now what do you do? Right? I think we're in this world I'd like to call guardrail programming. Right? It's really sad. We're like: I can make change because I have tests. Who does that? Who drives their car around banging against the guardrail saying, "Whoa! I'm glad I've got these guardrails because I'd never make it to the show on time."

[Audience laughter]

If you watch the talk, Rich uses "simplicity" the way Uncle Bob uses "discipline". They way these statements are used, they're roughly equivalent to Ken Thompson saying "Bugs are bugs. You write code with bugs because you do". The UB school throws tools and processes under the bus, saying that it's unsafe to rely solely on tools or processes.

Rich's rhetorical trick is brilliant -- I've heard that line quoted tens of times since the talk to argue against tests or tools or types. But, like guardrails, most tools and processes aren't about eliminating all bugs, they're about reducing the severity or probability of bugs. If we look at this particular function call, we can see that a static analysis tool failed to find this bug. Does that mean that we should give up on static analysis tools? A static analysis tool could look for all calls of submit_one_bio and show you the cases where the error is propagated up N levels only to be dropped. Gunawi et al. did exactly that and found a lot of bugs. A person basically can't do the same thing without tooling. They could try, but people are lucky if they get 95% accuracy when manually digging through things like this. The sheer volume of code guarantees that a human doing this by hand would make mistakes.

Even better than a static analysis tool would be a language that makes it harder to accidentally forget about checking for an error. One of the issues here is that it's sometimes valid to drop an error. There are a number of places where there's no interace that allows an error to get propagated out of the filesystem, making it correct to drop the error, modulo changing the interface. In the current situation, as an outsider reading the code, if you look at a bunch of calls that drop errors, it's very hard to say, for all of them, which of those is a bug and which of those is correct. If the default is that we have a kind of guardrail that says "this error must be checked", people can still incorrectly ignore errors, but you at least get an annotation that the omission was on purpose. For example, if you're forced to specifically write code that indicates that you're ignoring an error, and in code that's inteded to be robust, like filesystem code, code that drops an error on purpose is relatively likely to be accompanied by a comment explaining why the error was dropped.

Appendix: why wasn't this done earlier?

After all, it would be nice if we knew if modern filesystems could do basic tasks correctly. Filesystem developers probably know this stuff, but since I don't follow LKML, I had no idea whether or not things had improved since 2005 until we ran the experiment.

The papers we looked at here came out of Andrea and Remzi Arpaci-Dusseau's research lab. Remzi has a talk where he mentioned that grad students don't want to reproduce and update old work. That's entirely reasonable, given the incentives they face. And I don't mean to pick on academia here -- this work came out of academia, not industry. It's possible this kind of work simply wouldn't have happened if not for the academic incentive system.

In general, it seems to be quite difficult to fund work on correctness. There are a fair number of papers on new ways to find bugs, but that's relatively little work on applying existing techniques to existing code. In academia, that seems to be hard to get a good publication out of, in the open source world, that seems to be less interesting to people than writing new code. That's also entirely reasonable -- people should work on what they want, and even if they enjoy working on correctness, that's probably not a great career decision in general. I was at the RC career fair the other night and my badge said I was interested in testing. The first person who chatted me up opened with "do you work in QA?". Back when I worked in hardware, that wouldn't have been a red flag, but in software, "QA" is code for a low-skill, tedious, and poorly paid job. Much of industry considers testing and QA to be an afterthought. As a result, open source projects that companies rely on are often woefully underfunded. Google funds some great work (like afl-fuzz), but that's the exception and not the rule, even within Google, and most companies don't fund any open source work. The work in this post was done by a few people who are intentionally temporarily unemployed, which isn't really a scalable model.

Occasionally, you'll see someone spend a lot of effort on immproving correctness, but that's usually done as a massive amount of free labor. Kyle Kingsbury might be the canonical example of this -- my understanding is that he worked on the Jepsen distributed systems testing tool on nights and weekends for years before turning that into a consulting business. It's great that he did that -- he showed that almost every open source distributed system had serious data loss or corruption bugs. I think that's great, but stories about heoric effort like that always worry me because heroism doesn't scale. If Kyle hadn't come along, would most of the bugs that he and his tool found still plague open source distributed systems today? That's a scary thought.

If I knew how to fund more work on correctness, I'd try to convince you that we should switch to this new model, but I don't know of a funding model that works. I've set up a patreon (donation account), but it would be quite extraordinary if that was sufficient to actually fund a signifcant amount of work. If you look at how much programmers make off of donations, if I made two order of magnitude less than I could if I took a job in industry, that would already put me in the top 1% of programmers on patreon. If I made one order of magnitude less than I'd make in industry, that would be extraordinary. Off the top of my head, the only programmers who make more than that off of patreon either make something with much broader appeal (like games) or are Evan You, who makes one of the most widely use front-end libraries in existence. And if I actually made as much as I can make in industry, I suspect that would make me the highest grossing programmer on patreon, even though, by industry standards, my compensation hasn't been anything special.

If I had to guess, I'd say that part of the reason it's hard to fund this kind of work is that consumers don't incentivize companies to fund this sort of work. If you look at "big" tech companies, two of them are substantially more serious about correctness than their competitors. This results in many fewer horror stories about lost emails and documents as well as lost entire accounts. If you look at the impact on consumers, it might be something like the difference between 1% of people seeing lost/corrupt emails vs. 0.001%. I think that's pretty significant if you multiply that cost across all consumers, but the vast majority of consumers aren't going to make decisions based on that kind of difference. If you look at an area where correctness problems are much more apparent, like databases or backups, you'll find that even the worst solutions have defenders who will pop into any dicussions and say "works for me". A backup solution that works 90% of the time is quite bad, but if you have one that works 90% of the time, it will still have staunch defenders who drop into discussions to say things like "I've restored from backup three times and it's never failed! You must be making stuff up!". I don't blame companies for rationally responding to consumers, but I do think that the result is unfortunate for consumers.

Just as an aside, one of the great wonders of doing open work for free is that the more free work you do, the more people complain that you didn't do enough free work. As David MacIver has said, doing open source work is like doing normal paid work, except that you get paid in complaints instead of cash. It's basically guaranteed that the most common comment on this post, for all time, will be that didn't test someone's pet filesystem because we're btrfs shills or just plain lazy, even though we include a link to a repo that lets anyone add tests as they please. Pretty much every time I've done any kind of free experimental work, people who obvously haven't read the experimental setup or the source code complain that the experiment couldn't possibly be right because of [thing that isn't true that anyone could see by looking at the setup] and that it's absolutely inexcusable that I didn't run the experiment on the exact pet thing they wanted to see. Having played video games competitively in the distant past, I'm used to much more intense internet trash talk, but in general, this incentive system seems to be backwards.

Appendix: experimental setup

For the error injection setup, a high-level view of the experimental setup is that dmsetup was used to simulate bad blocks on the disk.

A list of the commands run looks something like:

cp images/btrfs.img.gz /tmp/tmpeas9efr6.gz
gunzip -f /tmp/tmpeas9efr6.gz
losetup -f
losetup /dev/loop19 /tmp/tmpeas9efr6
blockdev --getsize /dev/loop19
#        0 74078 linear /dev/loop19 0
#        74078 1 error
#        74079 160296 linear /dev/loop19 74079
dmsetup create fserror_test_1508727591.4736078
mount /dev/mapper/fserror_test_1508727591.4736078 /mnt/fserror_test_1508727591.4736078/
mount -t overlay -o lowerdir=/mnt/fserror_test_1508727591.4736078/,upperdir=/tmp/tmp4qpgdn7f,workdir=/tmp/tmp0jn83rlr overlay /tmp/tmpeuot7zgu/
./mmap_read /tmp/tmpeuot7zgu/test.txt
umount /tmp/tmpeuot7zgu/
rm -rf /tmp/tmp4qpgdn7f
rm -rf /tmp/tmp0jn83rlr
umount /mnt/fserror_test_1508727591.4736078/
dmsetup remove fserror_test_1508727591.4736078
losetup -d /dev/loop19
rm /tmp/tmpeas9efr6

See this github repo for the exact set of commands run to execute tests.

Note that all of these tests were done on linux, so fat means the linux fat implementation, not the windows fat implementation. zfs and reiserfs weren’t tested because they couldn’t be trivially tested in the exact same way that we tested other filesystems (one of us spent an hour or two trying to get zfs to work, but its configuration interface is inconsistent with all of the filesystems tested; reiserfs appears to have a consistent interface but testing it requires doing extra work for a filesystem that appears to be dead). ext3 support is now provided by the ext4 code, so what ext3 means now is different from what it meant in 2005.

All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines. All filesystems were configured with default settings. For btrfs, this meant duplicated metadata without duplicated data and, as far as we know, the settings wouldn't have made a difference for other filesystems.

The second part of this doesn’t have much experimental setup to speak of. The setup was to grep the linux source code for the relevant functions.

Thanks to Leah Hanson, David Wragg, Ben Kuhn, Wesley Aptekar-Cassels, Joel Borggrén-Franck, Yuri Vishnevsky, and Dan Puttick for comments/corrections on this post.

Keyboard latency

2017-10-16 08:00:00

If you look at “gaming" keyboards, a lot of them sell for $100 or more on the promise that they’re fast. Ad copy that you’ll see includes:

  • a custom designed keycap that has been made shorter to reduce the time it takes for your actions to register
  • 8x FASTER - Polling Rate of 1000Hz: Response time 0.1 milliseconds
  • Wield the ultimate performance advantage over your opponents with light operation 45g key switches and an actuation 40% faster than standard Cherry MX Red switches
  • World's Fastest Ultra Polling 1000Hz
  • World's Fastest Gaming Keyboard, 1000Hz Polling Rate, 0.001 Second Response Time

Despite all of these claims, I can only find one person who’s publicly benchmarked keyboard latency and they only tested two keyboards. In general, my belief is that if someone makes performance claims without benchmarks, the claims probably aren’t true, just like how code that isn’t tested (or otherwise verified) should be assumed broken.

The situation with gaming keyboards reminds me a lot of talking to car salesmen:

Salesman: this car is super safe! It has 12 airbags! Me: that’s nice, but how does it fare in crash tests? Salesman: 12 airbags!

Sure, gaming keyboards have 1000Hz polling, but so what?

Two obvious questions are:

  1. Does keyboard latency matter?
  2. Are gaming keyboards actually quicker than other keyboards?

Does keyboard latency matter?

A year ago, if you’d asked me if I was going to build a custom setup to measure keyboard latency, I would have said that’s silly, and yet here I am, measuring keyboard latency with a logic analyzer.

It all started because I had this feeling that some old computers feel much more responsive than modern machines. For example, an iMac G4 running macOS 9 or an Apple 2 both feel quicker than my 4.2 GHz Kaby Lake system. I never trust feelings like this because there’s decades of research showing that users often have feelings that are the literal opposite of reality, so got a high-speed camera and started measuring actual keypress-to-screen-update latency as well as mouse-move-to-screen-update latency. It turns out the machines that feel quick are actually quick, much quicker than my modern computer -- computers from the 70s and 80s commonly have keypress-to-screen-update latencies in the 30ms to 50ms range out of the box, whereas modern computers are often in the 100ms to 200ms range when you press a key in a terminal. It’s possible to get down to the 50ms range in well optimized games with a fancy gaming setup, and there’s one extraordinary consumer device that can easily get below 50ms, but the default experience is much slower. Modern computers have much better throughput, but their latency isn’t so great.

Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency. But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has 80k transistors running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.

Let’s look at the measured keypress-to-USB latency on some keyboards:

keyboard latency
(ms)
connection gaming
apple magic (usb) 15 USB FS
hhkb lite 2 20 USB FS
MS natural 4000 20 USB
das 3 25 USB
logitech k120 30 USB
unicomp model M 30 USB FS
pok3r vortex 30 USB FS
filco majestouch 30 USB
dell OEM 30 USB
powerspec OEM 30 USB
kinesis freestyle 2 30 USB FS
chinfai silicone 35 USB FS
razer ornata chroma 35 USB FS Yes
olkb planck rev 4 40 USB FS
ergodox 40 USB FS
MS comfort 5000 40 wireless
easterntimes i500 50 USB FS Yes
kinesis advantage 50 USB FS
genius luxemate i200 55 USB
topre type heaven 55 USB FS
logitech k360 60 "unifying"

The latency measurements are the time from when the key starts moving to the time when the USB packet associated with the key makes it out onto the USB bus. Numbers are rounded to the nearest 5 ms in order to avoid giving a false sense of precision. The easterntimes i500 is also sold as the tomoko MMC023.

The connection column indicates the connection used. USB FS stands for the usb full speed protocol, which allows up to 1000Hz polling, a feature commonly advertised by high-end keyboards. USB is the usb low speed protocol, which is the protocol most keyboards use. The ‘gaming’ column indicates whether or not the keyboard is branded as a gaming keyboard. wireless indicates some kind of keyboard-specific dongle and unifying is logitech's wireless device standard.

We can see that, even with the limited set of keyboards tested, there can be as much as a 45ms difference in latency between keyboards. Moreover, a modern computer with one of the slower keyboards attached can’t possibly be as responsive as a quick machine from the 70s or 80s because the keyboard alone is slower than the entire response pipeline of some older computers.

That establishes the fact that modern keyboards contribute to the latency bloat we’ve seen over the past forty years. The other half of the question is, does the latency added by a modern keyboard actually make a difference to users? From looking at the table, we can see that among the keyboard tested, we can get up to a 40ms difference in average latency. Is 40ms of latency noticeable? Let’s take a look at some latency measurements for keyboards and then look at the empirical research on how much latency users notice.

There’s a fair amount of empirical evidence on this and we can see that, for very simple tasks, people can perceive latencies down to 2ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

Are gaming keyboards faster than other keyboards?

I’d really like to test more keyboards before making a strong claim, but from the preliminary tests here, it appears that gaming keyboards aren’t generally faster than non-gaming keyboards.

Gaming keyboards often claim to have features that reduce latency, like connecting over USB FS and using 1000Hz polling. The USB low speed spec states that the minimum time between packets is 10ms, or 100 Hz. However, it’s common to see USB devices round this down to the nearest power of two and run at 8ms, or 125Hz. With 8ms polling, the average latency added from having to wait until the next polling interval is 4ms. With 1ms polling, the average latency from USB polling is 0.5ms, giving us a 3.5ms delta. While that might be a significant contribution to latency for a quick keyboard like the Apple magic keyboard, it’s clear that other factors dominate keyboard latency for most keyboards and that the gaming keyboards tested here are so slow that shaving off 3.5ms won’t save them.

Another thing to note about gaming keyboards is that they often advertise "n-key rollover" (the ability to have n simulataneous keys pressed at once — for many key combinations, typical keyboards will often only let you press two keys at once, excluding modifier keys). Although not generally tested here, I tried a "Razer DeathStalker Expert Gaming Keyboard" that advertises "Anti-ghosting capability for up to 10 simultaneous key presses". The Razer gaming keyboard did not have this capability in a useful manner and many combinations of three keys didn't work. Their advertising claim could, I suppose, technically true in that 3 in some cases could be "up to 10", but like gaming keyboards claiming to have lower latency due to 1000 Hz polling, the claim is highly misleading at best.

Conclusion

Most keyboards add enough latency to make the user experience noticeably worse, and keyboards that advertise speed aren’t necessarily faster. The two gaming keyboards we measured weren’t faster than non-gaming keyboards, and the fastest keyboard measured was a minimalist keyboard from Apple that’s marketed more on design than speed.

Previously, we've seen that terminals can add significant latency, up 100ms in mildly pessimistic conditions if you choose the "right" terminal. In a future post, we'll look at the entire end-to-end pipeline to see other places latency has crept in and we'll also look at how some modern devices keep latency down.

Appendix: where is the latency coming from?

A major source of latency is key travel time. It’s not a coincidence that the quickest keyboard measured also has the shortest key travel distance by a large margin. The video setup I’m using to measure end-to-end latency is a 240 fps camera, which means that frames are 4ms apart. When videoing “normal" keypresses and typing, it takes 4-8 frames for a key to become fully depressed. Most switches will start firing before the key is fully depressed, but the key travel time is still significant and can easily add 10ms of delay (or more, depending on the switch mechanism). Contrast this to the Apple "magic" keyboard measured, where the key travel is so short that it can’t be captured with a 240 fps camera, indicating that the key travel time is < 4ms.

Note that, unlike the other measurement I was able to find online, this measurement was from the start of the keypress instead of the switch activation. This is because, as a human, you don't activate the switch, you press the key. A measurement that starts from switch activiation time misses this large component to latency. If, for example, you're playing a game and you switch from moving forward to moving backwards when you see something happen, you have pay the cost of the key movement, which is different for different keyboards. A common response to this is that "real" gamers will preload keys so that they don't have to pay the key travel cost, but if you go around with a high speed camera and look at how people actually use their keyboards, the fraction of keypresses that are significantly preloaded is basically zero even when you look at gamers. It's possible you'd see something different if you look at high-level competitive gamers, but even then, just for example, people who use a standard wasd or esdf layout will typically not preload a key when going from back to forward. Also, the idea that it's fine that keys have a bunch of useless travel because you can pre-depress the key before really pressing the key is just absurd. That's like saying latency on modern computers is fine because some people build gaming boxes that, when run with unusually well optimzed software, get 50ms response time. Normal, non-hardcore-gaming users simply aren't going to do this. Since that's the vast majority of the market, even if all "serious" gamers did this, that would stll be a round error.

The other large sources of latency are scaning the keyboard matrix and debouncing. Neither of these delays are inherent -- keyboards use a matrix that has to be scanned instead of having a wire per-key because it saves a few bucks, and most keyboards scan the matrix at such a slow rate that it induces human noticable delays because that saves a few bucks, but a manufacturer willing to spend a bit more on manufacturing a keyboard could make the delay from that far below the threshold of human perception. See below for debouncing delay.

Although we didn't discuss throughput in this, when I measure my typing speed, I find that I can type faster with the low-travel Apple keyboard than with any of the other keyboards. There's no way to do a blinded experiment for this, but Gary Bernhardt and others have also observed the same thing. Some people claim that key travel doesn't matter for typing speed because they use the minimum amount of travel necessary and that this therefore can't matter, but as with the above claims on keypresses, if you walk around with a high speed camera and observe what actually happens when people type, it's very hard to find someone who actually does this.

2022 update

When I ran these experiments, it didn't seem that anyone was testing latency across multiple keyboards. I found the results I got so unintuitive that I tried to find anyone else's keyboard latency measurements and all I could find was a forum post from someone who tried to measure their keyboard (just one) and got results in the same range, but using a setup that wasn't fast enough to really measure the latency properly. I also video'd my test as well as non-test keypresses with a high-speed camera to see how much time it took to depress keys, and the results weren't obviously inconsistent with the results I got now.

Starting a year or two after I wrote the post, I witnessed some discussions from some gaming mouse and keyboard makers on how to make lower latency devices and they started releasing devices that actually have lower latency, as opposed to the devices they had, which basically had gaming skins and would often light up.

If you want a low-latency keyboard that isn't the Apple keyboard (quite a few people I've talked to report finger pain after using the Apple keyboard for an extended period of time), the SteelSeries Apex Pro is fairly low latency; for a mouse, the Corsair Sabre is also pretty quick.

Another change since then is that more people understand that debouncing doesn't have to add noticeable latency. When I wrote the original post, I had multiple keyboard makers explain to me that the post is wrong and it's impossible to not add latency when debouncing. I found that very odd since I'd expect a freshman EE or, for that matter, a high school kid who plays with electronics, to understand why that's not the case but, for whatever reason, multiple people who made keyboards for a living didn't understand this. Now, how to debounce without adding latency has become common knowledge and, when I see discussions where someone says debouncing must add a lot of latency, they usually get corrected. This knowledge has spread to most keyboard makers and reduced keyboard latency for some new keyboards, although I know there's still at least one keyboard maker that doesn't believe that you can debounce with low latency and they still add quite a bit of latency from their new keyboards as a result.

Appendix: counter-arguments to common arguments that latency doesn’t matter

Before writing this up, I read what I could find about latency and it was hard to find non-specialist articles or comment sections that didn’t have at least one of the arguments listed below:

Computers and devices are fast

The most common response to questions about latency is that input latency is basically zero, or so close to zero that it’s a rounding error. For example, two of the top comments on this slashdot post asking about keyboard latency are that keyboards are so fast that keyboard speed doesn’t matter. One person even says

There is not a single modern keyboard that has 50ms latency. You (humans) have that sort of latency.

As far as response times, all you need to do is increase the poll time on the USB stack

As we’ve seen, some devices do have latencies in the 50ms range. This quote as well as other comments in the thread illustrate another common fallacy -- that input devices are limited by the speed of the USB polling. While that’s technically possible, most devices are nowhere near being fast enough to be limited by USB polling latency.

Unfortunately, most online explanations of input latency assume that the USB bus is the limiting factor.

Humans can’t notice 100ms or 200ms latency

Here’s a “cognitive neuroscientist who studies visual perception and cognition" who refers to the fact that human reaction time is roughly 200ms, and then throws in a bunch more scientific mumbo jumbo to say that no one could really notice latencies below 100ms. This is a little unusual in that the commenter claims some kind of special authority and uses a lot of terminology, but it’s common to hear people claim that you can’t notice 50ms or 100ms of latency because human reaction time is 200ms. This doesn’t actually make sense because there are independent quantities. This line of argument is like saying that you wouldn’t notice a flight being delayed by an hour because the duration of the flight is six hours.

Another problem with this line of reasoning is that the full pipeline from keypress to screen update is quite long and if you say that it’s always fine to add 10ms here and 10ms there, you end up with a much larger amount of bloat through the entire pipeline, which is how we got where we are today, where can buy a system with the CPU that gives you the fastest single-threaded performance money can buy and get 6x the latency of a machine from the 70s.

It doesn’t matter because the game loop runs at 60 Hz

This is fundamentally the same fallacy as above. If you have a delay that’s half the duration a clock period, there’s a 50% chance the delay will push the event into the next processing step. That’s better than a 100% chance, but it’s not clear to me why people think that you’d need a delay as long as the the clock period for the delay to matter. And for reference, the 45ms delta between the slowest and fastest keyboard measured here corresponds to 2.7 frames at 60fps.

Keyboards can’t possibly response faster more quickly than 5ms/10ms/20ms due to debouncing

Even without going through contortions to optimize the switch mechanism, if you’re willing to put hysteresis into the system, there’s no reason that the keyboard can’t assume a keypress (or release) is happening the moment it sees an edge. This is commonly done for other types of systems and AFAICT there’s no reason keyboards couldn’t do the same thing (and perhaps some do). The debounce time might limit the repeat rate of the key, but there’s no inherent reason that it has to affect the latency. And if we're looking at the repeat rate, imagine we have a 5ms limit on the rate of change of the key state due to introducing hysteresis. That gives us one full keypress cycle (press and release) every 10ms, or 100 keypresses per second per key, which is well beyond the capacity of any human. You might argue that this introduces a kind of imprecision, which might matter in some applications (music, rythym games), but that's limited by the switch mechanism. Using a debouncing mechanism with hysteresis doesn't make us any worse off than we were before.

An additional problem with debounce delay is that most keyboard manufacturers seem to have confounded scan rate and debounce delay. It's common to see keyboards with scan rates in the 100 Hz to 200 Hz range. This is justified by statements like "there's no point in scanning faster because the debounce delay is 5ms", which combines two fallacies mentioned above. If you pull out the schematics for the Apple 2e, you can see that the scan rate is roughly 50 kHz. Its debounce time is roughly 6ms, which corresponds to a frequency of 167 Hz. Why scan so quickly? The fast scan allows the keyboard controller to start the clock on the debounce time almost immediately (after at most 20 microseconds), as opposed a modern keyboard that scans at 167 Hz, which might not start the clock on debouncing for 6ms, or after 300x as much time.

Apologies for not explaining terminology here, but I think that anyone making this objection should understand the explanation :-).

Appendix: experimental setup

The USB measurement setup was a USB cable. Cutting open the cable damages the signal integrity and I found that, with a very long cable, some keyboards that weakly drive the data lines didn't drive them strongly enough to get a good signal with the cheap logic analyzer I used.

The start-of-input was measured by pressing two keys at once -- one key on the keyboard and a button that was also connected to the logic analyzer. This introduces some jitter as the two buttons won’t be pressed at exactly the same time. To calibrate the setup, we used two identical buttons connected to the logic analyzer. The median jitter was < 1ms and the 90%-ile jitter was roughly 5ms. This is enough that tail latency measurements for quick keyboards aren’t really possible with this setup, but average latency measurements like the ones done here seem like they should be ok. The input jitter could probably be reduced to a negligible level by building a device to both trigger the logic analyzer and press a key on the keyboard under test at the same time. Average latency measurements would also get better with such a setup (because it would be easier to run a large number of measurements).

If you want to know the exact setup, a E-switch LL1105AF065Q switch was used. Power and ground were supplied by an arduino board. There’s no particular reason to use this setup. In fact, it’s a bit absurd to use an entire arduino to provide power, but this was done with spare parts that were lying around and this stuff just happened to be stuff that RC had in their lab, with the exception of the switches. There weren’t two identical copies of any switch, so we bought a few switches so we could do calibration measurements with two identical switches. The exact type of switch isn’t important here; any low-resistance switch would do.

Tests were done by pressing the z key and then looking for byte 29 on the USB bus and then marking the end of the first packet containing the appropriate information. But, as above, any key would do.

I don't actually trust this setup and I'd like to build a completely automated setup before testing more keyboards. While the measurements are in line with the one other keyboard measurement I could find online, this setup has an inherent imprecision that's probably in the 1ms to 10ms range. While averaging across multiple measurements reduces that imprecision, since the measurements are done by a human, it's not guaranteed and perhaps not even likely that the errors are independent and will average out.

This project was done with help from Wesley Aptekar-Cassels, Leah Hanson, and Kate Murphy.

Thanks to RC, Ahmad Jarara, Raph Levien, Peter Bhat Harkins, Brennan Chesley, Dan Bentley, Kate Murphy, Christian Ternus, Sophie Haskins, and Dan Puttick, for letting us use their keyboards for testing.

Thanks for Leah Hanson, Mark Feeney, Greg Kennedy, and Zach Allaun for comments/corrections/discussion on this post.

Branch prediction

2017-08-23 08:00:00

This is a pseudo-transcript for a talk on branch prediction given at Two Sigma on 8/22/2017 to kick off "localhost", a talk series organized by RC.

How many of you use branches in your code? Could you please raise your hand if you use if statements or pattern matching?

Most of the audience raises their hands

I won’t ask you to raise your hands for this next part, but my guess is that if I asked, how many of you feel like you have a good understanding of what your CPU does when it executes a branch and what the performance implications are, and how many of you feel like you could understand a modern paper on branch prediction, fewer people would raise their hands.

The purpose of this talk is to explain how and why CPUs do “branch prediction” and then explain enough about classic branch prediction algorithms that you could read a modern paper on branch prediction and basically know what’s going on.

Before we talk about branch prediction, let’s talk about why CPUs do branch prediction. To do that, we’ll need to know a bit about how CPUs work.

For the purposes of this talk, you can think of your computer as a CPU plus some memory. The instructions live in memory and the CPU executes a sequence of instructions from memory, where instructions are things like “add two numbers”, “move a chunk of data from memory to the processor”. Normally, after executing one instruction, the CPU will execute the instruction that’s at the next sequential address. However, there are instructions called “branches” that let you change the address next instruction comes from.

Here’s an abstract diagram of a CPU executing some instructions. The x-axis is time and the y-axis distinguishes different instructions.

Instructions executing sequentially

Here, we execute instruction A, followed by instruction B, followed by instruction C, followed by instruction D.

One way you might design a CPU is to have the CPU do all of the work for one instruction, then move on to the next instruction, do all of the work for the next instruction, and so on. There’s nothing wrong with this; a lot of older CPUs did this, and some modern very low-cost CPUs still do this. But if you want to make a faster CPU, you might make a CPU that works like an assembly line. That is, you break the CPU up into two parts, so that half the CPU can do the “front half” of the work for an instruction while half the CPU works on the “back half” of the work for an instruction, like an assembly line. This is typically called a pipelined CPU.

Instructions with overlapping execution

If you do this, the execution might look something like the above. After the first half of instruction A is complete, the CPU can work on the second half of instruction A while the first half of instruction B runs. And when the second half of A finishes, the CPU can start on both the second half of B and the first half of C. In this diagram, you can see that the pipelined CPU can execute twice as many instructions per unit time as the unpipelined CPU above.

There’s no reason that a CPU can only be broken up into two parts. We could break the CPU into three parts, and get a 3x speedup, or four parts and get a 4x speedup. This isn’t strictly true, and we generally get less than a 3x speedup for a three-stage pipeline or 4x speedup for a 4-stage pipeline because there’s overhead in breaking the CPU up into more parts and having a deeper pipeline.

One source of overhead is how branches are handled. One of the first things the CPU has to do for an instruction is to get the instruction; to do that, it has to know where the instruction is. For example, consider the following code:

if (x == 0) {
  // Do stuff
} else {
  // Do other stuff (things)
}
  // Whatever happens later

This might turn into assembly that looks something like

branch_if_not_equal x, 0, else_label
// Do stuff
goto end_label
else_label:
// Do things
end_label:
// whatever happens later

In this example, we compare x to 0. if_not_equal, then we branch to else_label and execute the code in the else block. If that comparison fails (i.e., if x is 0), we fall through, execute the code in the if block, and then jump to end_label in order to avoid executing the code in else block.

The particular sequence of instructions that’s problematic for pipelining is

branch_if_not_equal x, 0, else_label
???

The CPU doesn’t know if this is going to be

branch_if_not_equal x, 0, else_label
// Do stuff

or

branch_if_not_equal x, 0, else_label
// Do things

until the branch has finished (or nearly finished) executing. Since one of the first things the CPU needs to do for an instruction is to get the instruction from memory, and we don’t know which instruction ??? is going to be, we can’t even start on ??? until the previous instruction is nearly finished.

Earlier, when we said that we’d get a 3x speedup for a 3-stage pipeline or a 20x speedup for a 20-stage pipeline, that assumed that you could start a new instruction every cycle, but in this case the two instructions are nearly serialized.

non-overlapping execution due to branch stall

One way around this problem is to use branch prediction. When a branch shows up, the CPU will guess if the branch was taken or not taken.

speculating about a branch result

In this case, the CPU predicts that the branch won’t be taken and starts executing the first half of stuff while it’s executing the second half of the branch. If the prediction is correct, the CPU will execute the second half of stuff and can start another instruction while it’s executing the second half of stuff, like we saw in the first pipeline diagram.

overlapped execution after a correct prediction

If the prediction is wrong, when the branch finishes executing, the CPU will throw away the result from stuff.1 and start executing the correct instructions instead of the wrong instructions. Since we would’ve stalled the processor and not executed any instructions if we didn’t have branch prediction, we’re no worse off than we would’ve been had we not made a prediction (at least at the level of detail we’re looking at).

aborted prediction

What’s the performance impact of doing this? To make an estimate, we’ll need a performance model and a workload. For the purposes of this talk, our cartoon model of a CPU will be a pipelined CPU where non-branches take an average of one instruction per clock, unpredicted or mispredicted branches take 20 cycles, and correctly predicted branches take one cycle.

If we look at the most commonly used benchmark of “workstation” integer workloads, SPECint, the composition is maybe 20% branches, and 80% other operations. Without branch prediction, we then expect the “average” instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.2 * 20 + 0.8 * 1 = 4 + 0.8 = 4.8 cycles. With perfect, 100% accurate, branch prediction, we’d expect the average instruction to take 0.8 * 1 + 0.2 * 1 = 1 cycle, a 4.8x speedup! Another way to look at it is that if we have a pipeline with a 20-cycle branch misprediction penalty, we have nearly a 5x overhead from our ideal pipelining speedup just from branches alone.

Let’s see what we can do about this. We’ll start with the most naive things someone might do and work our way up to something better.

Predict taken

Instead of predicting randomly, we could look at all branches in the execution of all programs. If we do this, we’ll see that taken and not not-taken branches aren’t exactly balanced -- there are substantially more taken branches than not-taken branches. One reason for this is that loop branches are often taken.

If we predict that every branch is taken, we might get 70% accuracy, which means we’ll pay the the misprediction cost for 30% of branches, making the cost of of an average instruction (0.8 + 0.7 * 0.2) * 1 + 0.3 * 0.2 * 20 = 0.94 + 1.2. = 2.14. If we compare always predicting taken to no prediction and perfect prediction, always predicting taken gets a large fraction of the benefit of perfect prediction despite being a very simple algorithm.

2.14 cycles per instruction

Backwards taken forwards not taken (BTFNT)

Predicting branches as taken works well for loops, but not so great for all branches. If we look at whether or not branches are taken based on whether or not the branch is forward (skips over code) or backwards (goes back to previous code), we can see that backwards branches are taken more often than forward branches, so we could try a predictor which predicts that backward branches are taken and forward branches aren’t taken (BTFNT). If we implement this scheme in hardware, compiler writers will conspire with us to arrange code such that branches the compiler thinks will be taken will be backwards branches and branches the compiler thinks won’t be taken will be forward branches.

If we do this, we might get something like 80% prediction accuracy, making our cost function (0.8 + 0.8 * 0.2) * 1 + 0.2 * 0.2 * 20 = 0.96 + 0.8 = 1.76 cycles per instruction.

1.76 cycles per instruction

Used by

  • PPC 601(1993): also uses compiler generated branch hints
  • PPC 603

One-bit

So far, we’ve look at schemes that don’t store any state, i.e., schemes where the prediction ignores the program’s execution history. These are called static branch prediction schemes in the literature. These schemes have the advantage of being simple but they have the disadvantage of being bad at predicting branches whose behavior change over time. If you want an example of a branch whose behavior changes over time, you might imagine some code like

if (flag) {
  // things
  }

Over the course of the program, we might have one phase of the program where the flag is set and the branch is taken and another phase of the program where flag isn’t set and the branch isn’t taken. There’s no way for a static scheme to make good predictions for a branch like that, so let’s consider dynamic branch prediction schemes, where the prediction can change based on the program history.

One of the simplest things we might do is to make a prediction based on the last result of the branch, i.e., we predict taken if the branch was taken last time and we predict not taken if the branch wasn’t taken last time.

Since having one bit for every possible branch is too many bits to feasibly store, we’ll keep a table of some number of branches we’ve seen and their last results. For this talk, let’s store not taken as 0 and taken as 1.

prediction table with 1-bit entries indexed by low bits of branch address

In this case, just to make things fit on a diagram, we have a 64-entry table, which mean that we can index into the table with 6 bits, so we index into the table with the low 6 bits of the branch address. After we execute a branch, we update the entry in the prediction table (highlighted below) and the next time the branch is executed again, we index into the same entry and use the updated value for the prediction.

indexed entry changes on update

It’s possible that we’ll observe aliasing and two branches in two different locations will map to the same location. This isn’t ideal, but there’s a tradeoff between table speed & cost vs. size that effectively limits the size of the table.

If we use a one-bit scheme, we might get 85% accuracy, a cost of (0.8 + 0.85 * 0.2) * 1 + 0.15 * 0.2 * 20 = 0.97 + 0.6 = 1.57 cycles per instruction.

1.57 cycles per instruction

Used by

  • DEC EV4 (1992)
  • MIPS R8000 (1994)

Two-bit

A one-bit scheme works fine for patterns like TTTTTTTT… or NNNNNNN… but will have a misprediction for a stream of branches that’s mostly taken but has one branch that’s not taken, ...TTTNTTT... This can be fixed by adding second bit for each address and implementing a saturating counter. Let’s arbitrarily say that we count down when a branch is not taken and count up when it’s taken. If we look at the binary values, we’ll then end up with:

00: predict Not
01: predict Not
10: predict Taken
11: predict Taken

The “saturating” part of saturating counter means that if we count down from 00, instead of underflowing, we stay at 00, and similar for counting up from 11 staying at 11. This scheme is identical to the one-bit scheme, except that each entry in the prediction table is two bits instead of one bit.

same as 1-bit, except that the table has 2 bits

Compared to a one-bit scheme, a two-bit scheme can have half as many entries at the same size/cost (if we only consider the cost of storage and ignore the cost of the logic for the saturating counter), but even so, for most reasonable table sizes a two-bit scheme provides better accuracy.

Despite being simple, this works quite well, and we might expect to see something like 90% accuracy for a two bit predictor, which gives us a cost of 1.38 cycles per instruction.

1.38 cycles per instruction

One natural thing to do would be to generalize the scheme to an n-bit saturating counter, but it turns out that adding more bits has a relatively small effect on accuracy. We haven’t really discussed the cost of the branch predictor, but going from 2 bits to 3 bits per branch increases the table size by 1.5x for little gain, which makes it not worth the cost in most cases. The simplest and most common things that we won’t predict well with a two-bit scheme are patterns like NTNTNTNTNT... or NNTNNTNNT…, but going to n-bits won’t let us predict those patterns well either!

Used by

  • LLNL S-1 (1977)
  • CDC Cyber? (early 80s)
  • Burroughs B4900 (1982): state stored in instruction stream; hardware would over-write instruction to update branch state
  • Intel Pentium (1993)
  • PPC 604 (1994)
  • DEC EV45 (1993)
  • DEC EV5 (1995)
  • PA 8000 (1996): actually a 3-bit shift register with majority vote

Two-level adaptive, global (1991)

If we think about code like

for (int i = 0; i < 3; ++i) {
  // code here.
  }

That code will produce a pattern of branches like TTTNTTTNTTTN....

If we know the last three executions of the branch, we should be able to predict the next execution of the branch:

TTT:N
TTN:T
TNT:T
NTT:T

The previous schemes we’ve considered use the branch address to index into a table that tells us if the branch is, according to recent history, more likely to be taken or not taken. That tells us which direction the branch is biased towards, but it can’t tell us that we’re in the middle of a repetitive pattern. To fix that, we’ll store the history of the most recent branches as well as a table of predictions.

Use global branch history and branch address to index into prediction table

In this example, we concatenate 4 bits of branch history together with 2 bits of branch address to index into the prediction table. As before, the prediction comes from a 2-bit saturating counter. We don’t want to only use the branch history to index into our prediction table since, if we did that, any two branches with the same history would alias to the same table entry. In a real predictor, we’d probably have a larger table and use more bits of branch address, but in order to fit the table on a slide, we have an index that’s only 6 bits long.

Below, we’ll see what gets updated when we execute a branch.

Update changes index because index uses bits from branch history

The bolded parts are the parts that were updated. In this diagram, we shift new bits of branch history in from right to left, updating the branch history. Because the branch history is updated, the low bits of the index into the prediction table are updated, so the next time we take the same branch again, we’ll use a different entry in the table to make the prediction, unlike in previous schemes where the index is fixed by the branch address. The old entry’s value is updated so that the next time we take the same branch again with the same branch history, we’ll have the updated prediction.

Since the history in this scheme is global, this will correctly predict patterns like NTNTNTNT… in inner loops, but may not always correct make predictions for higher-level branches because the history is global and will be contaminated with information from other branches. However, the tradeoff here is that keeping a global history is cheaper than keeping a table of local histories. Additionally, using a global history lets us correctly predict correlated branches. For example, we might have something like:

if x > 0:
  x -= 1
if y > 0:
  y -= 1
if x * y > 0:
  foo()

If either the first branch or the next branch isn’t taken, then the third branch definitely will not be taken.

With this scheme, we might get 93% accuracy, giving us 1.27 cycles per instruction.

1.27 cycles per instruction

Used by

  • Pentium MMX (1996): 4-bit global branch history

Two-level adaptive, local [1992]

As mentioned above, an issue with the global history scheme is that the branch history for local branches that could be predicted cleanly gets contaminated by other branches.

One way to get good local predictions is to keep separate branch histories for separate branches.

keep a table of per-branch histories instead of a global history

Instead of keeping a single global history, we keep a table of local histories, index by the branch address. This scheme is identical to the global scheme we just looked at, except that we keep multiple branch histories. One way to think about this is that having global history is a special case of local history, where the number of histories we keep track of is 1.

With this scheme, we might get something like 94% accuracy, which gives us a cost of 1.23 cycles per instruction.

1.23 cycles per instruction

Used by

gshare

One tradeoff a global two-level scheme has to make is that, for a prediction table of a fixed size, bits must be dedicated to either the branch history or the branch address. We’d like to give more bits to the branch history because that allows correlations across greater “distance” as well as tracking more complicated patterns and we’d like to give more bits to the branch address to avoid interference between unrelated branches.

We can try to get the best of both worlds by hashing both the branch history and the branch address instead of concatenating them. One of the simplest reasonable things one might do, and the first proposed mechanism was to xor them together. This two-level adaptive scheme, where we xorthe bits together is called gshare.

hash branch address and branch history instead of appending

With this scheme, we might see something like 94% accuracy. That’s the accuracy we got from the local scheme we just looked at, but gshare avoids having to keep a large table of local histories; getting the same accuracy while having to track less state is a significant improvement.

Used by

agree (1997)

One reason for branch mispredictions is interference between different branches that alias to the same location. There are many ways to reduce interference between branches that alias to the same predictor table entry. In fact, the reason this talk only runs into schemes invented in the 90s is because a wide variety of interference-reducing schemes were proposed and there are too many to cover in half an hour.

We’ll look at one scheme which might give you an idea of what an interference-reducing scheme could look like, the “agree” predictor. When two branch-history pairs collide, the predictions either match or they don’t. If they match, we’ll call that neutral interference and if they don’t, we’ll call that negative interference. The idea is that most branches tend to be strongly biased (that is, if we use two-bit entries in the predictor table, we expect that, without interference, most entries will be 00 or 11 most of the time, not 01 or 10). For each branch in the program, we’ll store one bit, which we call the “bias”. The table of predictions will, instead of storing the absolute branch predictions, store whether or not the prediction matches or does not match the bias.

predict whether or not a branch agrees with its bias as opposed to whether or not it's taken

If we look at how this works, the predictor is identical to a gshare predictor, except that we make the changes mentioned above -- the prediction is agree/disagree instead of taken/not-taken and we have a bias bit that’s indexed by the branch address, which gives us something to agree or disagree with. In the original paper, they propose using the first thing you see as the bias and other people have proposed using profile-guided optimization (basically running the program and feeding the data back to the compiler) to determine the bias.

Note that, when we execute a branch and then later come back around to the same branch, we’ll use the same bias bit because the bias is indexed by the branch address, but we’ll use a different predictor table entry because that’s indexed by both the branch address and the branch history.

updating uses the same bias but a different meta-prediction table entry

If it seems weird that this would do anything, let’s look at a concrete example. Say we have two branches, branch A which is taken with 90% probability and branch B which is taken with 10% probability. If those two branches alias and we assume the probabilities that each branch is taken are independent, the probability that they disagree and negatively interfere is P(A taken) * P(B not taken) + P(A not taken) + P(B taken) = (0.9 * 0.9) + (0.1 * 0.1) = 0.82.

If we use the agree scheme, we can re-do the calculation above, but the probability that the two branches disagree and negatively interfere is P(A agree) * P(B disagree) + P(A disagree) * P(B agree) = P(A taken) * P(B taken) + P(A not taken) * P(B taken) = (0.9 * 0.1) + (0.1 * 0.9) = 0.18. Another way to look at it is, to have destructive interference, one of the branches must disagree with its bias. By definition, if we’ve correctly determined the bias, this cannot be likely to happen.

With this scheme, we might get something like 95% accuracy, giving us 1.19 cycles per instruction.

1.19 cycles per instruction

Used by

  • PA-RISC 8700 (2001)

Hybrid (1993)

As we’ve seen, local predictors can predict some kinds of branches well (e.g., inner loops) and global predictors can predict some kinds of branches well (e.g., some correlated branches). One way to try to get the best of both worlds is to have both predictors, then have a meta predictor that predicts if the local or the global predictor should be used. A simple way to do this is to have the meta-predictor use the same scheme as the two-bit predictor above, except that instead of predicting taken or not taken it predicts local predictor or global predictor

predict which of two predictors is correct instead of predicting if the branch is taken

Just as there are many possible interference-reducing schemes, of which the agree predictor, above is one, there are many possible hybrid schemes. We could use any two predictors, not just a local and global predictor, and we could even use more than two predictors.

If we use a local and global predictor, we might get something like 96% accuracy, giving us 1.15 cycles per instruction.

1.15 cycles per instruction

Used by

  • DEC EV6 (1998): combination of local (1k entries, 10 history bits, 3 bit counter) & global (4k entries, 12 history bits, 2 bit counter) predictors
  • IBM POWER4 (2001): local (16k entries) & gshare (16k entries, 11 history bits, xor with branch address, 16k selector table)
  • IBM POWER5 (2004): combination of bimodal (not covered) and two-level adaptive
  • IBM POWER7 (2010)

Not covered

There are a lot of things we didn’t cover in this talk! As you might expect, the set of material that we didn’t cover is much larger than what we did cover. I’ll briefly describe a few things we didn’t cover, with references, so you can look them up if you’re interested in learning more.

One major thing we didn’t talk about is how to predict the branch target. Note that this needs to be done even for some unconditional branches (that is, branches that don’t need directional prediction because they’re always taken), since (some) unconditional branches have unknown branch targets.

Branch target prediction is expensive enough that some early CPUs had a branch prediction policy of “always predict not taken” because a branch target isn’t necessary when you predict the branch won’t be taken! Always predicting not taken has poor accuracy, but it’s still better than making no prediction at all.

Among the interference reducing predictors we didn’t discuss are bi-mode, gskew, and YAGS. Very briefly, bi-mode is somewhat like agree in that it tries to seperate out branches based on direction, but the mechanism used in bi-mode is that we keep multiple predictor tables and a third predictor based on the branch address is used to predict which predictor table gets use for the particular combination of branch and branch history. Bi-mode appears to be more successful than agree in that it's seen wider use. With gskew, we keep at least three predictor tables and use a different hash to index into each table. The idea is that, even if two branches alias, those two branches will only alias in one of the tables, so we can use a vote and the result from the other two tables will override the potentially bad result from the aliasing table. I don't know how to describe YAGS very briefly :-).

Because we didn't take about speed (as in latency), a prediction strategy we didn't talk about is to have a small/fast predictor that can be overridden by a slower and more accurate predictor when the slower predictor computes its result.

Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use perceptron based branch predictors. Perceptrons are single-layer neural nets.

It’s been argued that Intel Haswell (2013) uses a variant of a TAGE predictor. TAGE stands for TAgged GEometric history length predictor. If we look at the predictors we’ve covered and look at actual executions of programs to see which branches we’re not predicting correctly, one major class of branches are branches that need a lot of history -- a significant number of branches need tens or hundreds of bits of history and some even need more than a thousand bits of branch history. If we have a single predictor or even a hybrid predictor that combines a few different predictors, it’s counterproductive to keep a thousand bits of history because that will make predictions worse for the branches which need a relatively small amount of history (especially relative to the cost), which is most branches. One of the ideas in the TAGE predictor is that, by keeping a geometric series of history lengths, each branch can use the appropriate history. That explains the GE. The TA part is that branches are tagged, which is a mechanism we don’t discuss that the predictor uses to track which branches should use which set of history.

Modern CPUs often have specialized predictors, e.g., a loop predictor can accurately predict loop branches in cases where a generalized branch predictor couldn’t reasonably store enough history to make perfect predictions for every iteration of the loop.

We didn’t talk at all about the tradeoff between using up more space and getting better predictions. Not only does changing the size of the table change the performance of a predictor, it also changes which predictors are better relative to each other.

We also didn’t talk at all about how different workloads affect different branch predictors. Predictor performance varies not only based on table size but also based on which particular program is run.

We’ve also talked about branch misprediction cost as if it’s a fixed thing, but it is not, and for that matter, the cost of non-branch instructions also varies widely between different workloads.

I tried to avoid introducing non-self-explanatory terminology when possible, so if you read the literature, terminology will be somewhat different.

Conclusion

We’ve looked at a variety of classic branch predictors and very briefly discussed a couple of newer predictors. Some of the classic predictors we discussed are still used in CPUs today, and if this were an hour long talk instead of a half-hour long talk, we could have discussed state-of-the-art predictors. I think that a lot of people have an idea that CPUs are mysterious and hard to understand, but I think that CPUs are actually easier to understand than software. I might be biased because I used to work on CPUs, but I think that this is not a result of my bias but something fundamental.

If you think about the complexity of software, the main limiting factor on complexity is your imagination. If you can imagine something in enough detail that you can write it down, you can make it. Of course there are cases where that’s not the limiting factor and there’s something more practical (e.g., the performance of large scale applications), but I think that most of us spend most of our time writing software where the limiting factor is the ability to create and manage complexity.

Hardware is quite different from this in that there are forces that push back against complexity. Every chunk of hardware you implement costs money, so you want to implement as little hardware as possible. Additionally, performance matters for most hardware (whether that’s absolute performance or performance per dollar or per watt or per other cost), and adding complexity makes hardware slower, which limits performance. Today, you can buy an off-the-shelf CPU for $300 which can be overclocked to 5 GHz. At 5 GHz, one unit of work is one-fifth of one nanosecond. For reference, light travels roughly one foot in one nanosecond. Another limiting factor is that people get pretty upset when CPUs don’t work perfectly all of the time. Although CPUs do have bugs, the rate of bugs is much lower than in almost all software, i.e., the standard to which they’re verified/tested is much higher. Adding complexity makes things harder to test and verify. Because CPUs are held to a higher correctness standard than most software, adding complexity creates a much higher test/verification burden on CPUs, which makes adding a similar amount of complexity much more expensive in hardware than in software, even ignoring the other factors we discussed.

A side effect of these factors that push back against chip complexity is that, for any particular “high-level” general purpose CPU feature, it is generally conceptually simple enough that it can be described in a half-hour or hour-long talk. CPUs are simpler than many programmers think! BTW, I say “high-level” to rule out things like how transistors and circuit design, which can require a fair amount of low-level (physics or solid-state) background to understand.

CPU internals series

Thanks to Leah Hanson, Hari Angepat, and Nick Bergson-Shilcock for reviewing practice versions of the talk and to Fred Clausen Jr for finding a typo in this post. Apologies for the somewhat slapdash state of this post -- I wrote it quickly so that people who attended the talk could refer to the “transcript ” soon afterwards and look up references, but this means that there are probably more than the usual number of errors and that the organization isn’t as nice as it would be for a normal blog post. In particular, things that were explained using a series of animations in the talk are not explained in the same level of detail and on skimming this, I notice that there’s less explanation of what sorts of branches each predictor doesn’t handle well, and hence less motivation for each predictor. I may try to go back and add more motivation, but I’m unlikely to restructure the post completely and generate a new set of graphics that better convey concepts when there are a couple of still graphics next to text. Thanks to Julien Vivenot, Ralph Corderoy, Vaibhav Sagar, Mindy Preston, Stefan Kanthak, and Uri Shaked for catching typos in this hastily written post.

Sattolo's algorithm

2017-08-09 08:00:00

I recently had a problem where part of the solution was to do a series of pointer accesses that would walk around a chunk of memory in pseudo-random order. Sattolo's algorithm provides a solution to this because it produces a permutation of a list with exactly one cycle, which guarantees that we will reach every element of the list even though we're traversing it in random order.

However, the explanations of why the algorithm worked that I could find online either used some kind of mathematical machinery (Stirling numbers, assuming familiarity with cycle notation, etc.), or used logic that was hard for me to follow. I find that this is common for explanations of concepts that could, but don't have to, use a lot of mathematical machinery. I don't think there's anything wrong with using existing mathematical methods per se -- it's a nice mental shortcut if you're familiar with the concepts. If you're taking a combinatorics class, it makes sense to cover Stirling numbers and then rattle off a series of results whose proofs are trivial if you're familiar with Stirling numbers, but for people who are only interested in a single result, I think it's unfortunate that it's hard to find a relatively simple explanation that doesn't require any background. When I was looking for a simple explanation, I also found a lot of people who were using Sattolo's algorithm in places where it wasn't appropriate and also people who didn't know that Sattolo's algorithm is what they were looking for, so here's an attempt at an explanation of why the algorithm works that doesn't assume an undergraduate combinatorics background.

Before we look at Sattolo's algorithm, let's look at Fisher-Yates, which is an in-place algorithm that produces a random permutation of an array/vector, where every possible permutation occurs with uniform probability.

We'll look at the code for Fisher-Yates and then how to prove that the algorithm produces the intended result.

def shuffle(a):
    n = len(a)
    for i in range(n - 1):  # i from 0 to n-2, inclusive.
        j = random.randrange(i, n)  # j from i to n-1, inclusive.
        a[i], a[j] = a[j], a[i]  # swap a[i] and a[j].

shuffle takes an array and produces a permutation of the array, i.e., it shuffles the array. We can think of this loop as placing each element of the array, a, in turn, from a[0] to a[n-2]. On some iteration, i, we choose one of n-i elements to swap with and swap element i with some random element. The last element in the array, a[n-1], is skipped because it would always be swapped with itself. One way to see that this produces every possible permutation with uniform probability is to write down the probability that each element will end up in any particular location1. Another way to do it is to observe two facts about this algorithm:

  1. Every output that Fisher-Yates produces is produced with uniform probability
  2. Fisher-Yates produces as many outputs as there are permutations (and each output is a permutation)

(1) For each random choice we make in the algorithm, if we make a different choice, we get a different output. For example, if we look at the resultant a[0], the only way to place the element that was originally in a[k] (for some k) in the resultant a[0] is to swap a[0] with a[k] in iteration 0. If we choose a different element to swap with, we'll end up with a different resultant a[0]. Once we place a[0] and look at the resultant a[1], the same thing is true of a[1] and so on for each a[i]. Additionally, each choice reduces the range by the same amount -- there's a kind of symmetry, in that although we place a[0] first, we could have placed any other element first; every choice has the same effect. This is vaguely analogous to the reason that you can pick an integer uniformly at random by picking digits uniformly at random, one at a time.

(2) How many different outputs does Fisher-Yates produce? On the first iteration, we fix one of n possible choices for a[0], then given that choice, we fix one of n-1 choices for a[1], then one of n-2 for a[2], and so on, so there are n * (n-1) * (n-2) * ... 2 * 1 = n! possible different outputs.

This is exactly the same number of possible permutations of n elements, by pretty much the same reasoning. If we want to count the number of possible permutations of n elements, we first pick one of n possible elements for the first position, n-1 for the second position, and so on resulting in n! possible permutations.

Since Fisher-Yates only produces unique permutations and there are exactly as many outputs as there are permutations, Fisher-Yates produces every possible permutation. Since Fisher-Yates produces each output with uniform probability, it produces all possible permutations with uniform probability.

Now, let's look at Sattolo's algorithm, which is almost identical to Fisher-Yates and also produces a shuffled version of the input, but produces something quite different:

def sattolo(a):
    n = len(a)
    for i in range(n - 1):
        j = random.randrange(i+1, n)  # i+1 instead of i
        a[i], a[j] = a[j], a[i]

Instead of picking an element at random to swap with, like we did in Fisher-Yates, we pick an element at random that is not the element being placed, i.e., we do not allow an element to be swapped with itself. One side effect of this is that no element ends up where it originally started.

Before we talk about why this produces the intended result, let's make sure we're on the same page regarding terminology. One way to look at an array is to view it as a description of a graph where the index indicates the node and the value indicates where the edge points to. For example, if we have the list 0 2 3 1, this can be thought of as a directed graph from its indices to its value, which is a graph with the following edges:

0 -> 0
1 -> 2
2 -> 3
3 -> 1

Node 0 points to itself (because the value at index 0 is 0), node 1 points to node 2 (because the value at index 1 is 2), and so on. If we traverse this graph, we see that there are two cycles. 0 -> 0 -> 0 ... and 1 -> 2 -> 3 -> 1....

Let's say we swap the element in position 0 with some other element. It could be any element, but let's say that we swap it with the element in position 2. Then we'll have the list 3 2 0 1, which can be thought of as the following graph:

0 -> 3
1 -> 2
2 -> 0
3 -> 1

If we traverse this graph, we see the cycle 0 -> 3 -> 1 -> 2 -> 0.... This is an example of a permutation with exactly one cycle.

If we swap two elements that belong to different cycles, we'll merge the two cycles into a single cycle. One way to see this is when we swap two elements in the list, we're essentially picking up the arrow-heads pointing to each element and swapping where they point (rather than the arrow-tails, which stay put). Tracing the result of this is like tracing a figure-8. Just for example, say if we swap 0 with an arbitrary element of the other cycle, let's say element 2, we'll end up with 3 2 0 1, whose only cycle is 0 -> 3 -> 1 -> 2 -> 0.... Note that this operation is reversible -- if we do the same swap again, we end up with two cycles again. In general, if we swap two elements from the same cycle, we break the cycle into two separate cycles.

If we feed a list consisting of 0 1 2 ... n-1 to Sattolo's algorithm we'll get a permutation with exactly one cycle. Furthermore, we have the same probability of generating any permutation that has exactly one cycle. Let's look at why Sattolo's generates exactly one cycle. Afterwards, we'll figure out why it produces all possible cycles with uniform probability.

For Sattolo's algorithm, let's say we start with the list 0 1 2 3 ... n-1, i.e., a list with n cycles of length 1. On each iteration, we do one swap. If we swap elements from two separate cycles, we'll merge the two cycles, reducing the number of cycles by 1. We'll then do n-1 iterations, reducing the number of cycles from n to n - (n-1) = 1.

Now let's see why it's safe to assume we always swap elements from different cycles. In each iteration of the algorithm, we swap some element with index > i with the element at index i and then increment i. Since i gets incremented, the element that gets placed into index i can never be swapped again, i.e., each swap puts one of the two elements that was swapped into its final position, i.e., for each swap, we take two elements that were potentially swappable and render one of them unswappable.

When we start, we have n cycles of length 1, each with 1 element that's swappable. When we swap the initial element with some random element, we'll take one of the swappable elements and render it unswappable, creating a cycle of length 2 with 1 swappable element and leaving us with n-2 other cycles, each with 1 swappable element.

The key invariant that's maintained is that each cycle has exactly 1 swappable element. The invariant holds in the beginning when we have n cycles of length 1. And as long as this is true, every time we merge two cycles of any length, we'll take the swappable element from one cycle and swap it with the swappable element from the other cycle, rendering one of the two elements unswappable and creating a longer cycle that still only has one swappable element, maintaining the invariant.

Since we cannot swap two elements from the same cycle, we merge two cycles with every swap, reducing the number of cycles by 1 with each iteration until we've run n-1 iterations and have exactly one cycle remaining.

To see that we generate each cycle with equal probability, note that there's only one way to produce each output, i.e., changing any particular random choice results in a different output. In the first iteration, we randomly choose one of n-1 placements, then n-2, then n-3, and so on, so for any particular cycle, we produce it with probability (n-1) * (n-2) * (n-3) ... * 2 * 1 = (n-1)!. If we can show that there are (n-1)! permutations with exactly one cycle, then we'll know that we generate every permutation with exactly one cycle with uniform probability.

Let's say we have an arbitrary list of length n that has exactly one cycle and we add a single element, there are n ways to extend that to become a cycle of length n+1 because there are n places we could add in the new element and keep the cycle, which means that the number of cycles of length n+1, cycles(n+1), is n * cycles(n).

For example, say we have a cycle that produces the path 0 -> 1 -> 2 -> 0 ... and we want to add a new element, 3. We can substitute -> 3 -> for any -> and get a cycle of length 4 instead of length 3.

In the base case, there's one cycle of length 2, the permutation 1 0 (the other permutation of length two, 0 1, has two cycles of length one instead of having a cycle of length 2), so we know that cycles(2) = 1. If we apply the recurrence above, we get that cycles(n) = (n-1)!, which is exactly the number of different permutations that Sattolo's algorithm generates, which means that we generate all possible permutations with one cycle. Since we know that we generate each cycle with uniform probability, we now know that we generate all possible one-cycle permutations with uniform probability.

An alternate way to see that there are (n-1)! permutations with exactly one cycle, is that we rotate each cycle around so that 0 is at the start and write it down as 0 -> i -> j -> k -> .... The number of these is the same as the number of permutations of elements to the right of the 0 ->, which is (n-1)!.

Conclusion

We've looked at two algorithms that are identical, except for a two character change. These algorithms produce quite different results -- one algorithm produces a random permutation and the other produces a random permutation with exactly one cycle. I think these algorithms are neat because they're so simple, just a double for loop with a swap.

In practice, you probably don't "need" to know how these algorithms work because the standard library for most modern languages will have some way of producing a random shuffle. And if you have a function that will give you a shuffle, you can produce a permutation with exactly one cycle if you don't mind a non-in-place algorithm that takes an extra pass. I'll leave that as an exercise for the reader, but if you want a hint, one way to do it parallels the "alternate" way to see that there are (n-1)! permutations with exactly one cycle.

Although I said that you probably don't need to know this stuff, you do actually need to know it if you're going to implement a custom shuffling algorithm! That may sound obvious, but there's a long history of people implementing incorrect shuffling algorithms. This was common in games and on online gambling sites in the 90s and even the early 2000s and you still see the occasional mis-implemented shuffle, e.g., when Microsoft implemented a bogus shuffle and failed to properly randomize a browser choice poll. At the time, the top Google hit for javascript random array sort was the incorrect algorithm that Microsoft ended up using. That site has been fixed, but you can still find incorrect tutorials floating around online.

Appendix: generating a random derangement

A permutation where no element ends up in its original position is called a derangement. When I searched for uses of Sattolo's algorithm, I found many people using Sattolo's algorithm to generate random derangements. While Sattolo's algorithm generates derangements, it only generates derangements with exactly one cycle, and there are derangements with more than one cycle (e.g., 3 2 1 0), so it can't possibly generate random derangements with uniform probability.

One way to generate random derangements is to generate random shuffles using Fisher-Yates and then retry until we get a derangement:

def derangement(n):
    assert n != 1, "can't have a derangement of length 1"
    a = list(range(n))
    while not is_derangement(a):
        shuffle(a)
    return a

This algorithm is simple, and is overwhelmingly likely to eventually return a derangement (for n != 1), but it's not immediately obvious how long we should expect this to run before it returns a result. Maybe we'll get a derangement on the first try and run shuffle once, or maybe it will take 100 tries and we'll have to do 100 shuffles before getting a derangement.

To figure this out, we'll want to know the probability that a random permutation (shuffle) is a derangement. To get that, we'll want to know, given a list of of length n, how many permutations there are and how many derangements there are.

Since we're deep in the appendix, I'll assume that you know the number of permutations of a n elements is n! what binomial coefficients are, and are comfortable with Taylor series.

To count the number of derangements, we can start with the number of permutations, n!, and subtract off permutations where an element remains in its starting position, (n choose 1) * (n - 1)!. That isn't quite right because this double subtracts permutations where two elements remain in the starting position, so we'll have to add back (n choose 2) * (n - 2)!. That isn't quite right because we've overcorrected elements with three permutations, so we'll have to add those back, and so on and so forth, resulting in ∑ (−1)ᵏ (n choose k)(n−k)!. If we expand this out and divide by n! and cancel things out, we get ∑ (−1)ᵏ (1 / k!). If we look at the limit as the number of elements goes to infinity, this looks just like the Taylor series for e^x where x = -1, i.e., 1/e, i.e., in the limit, we expect that the fraction of permutations that are derangements is 1/e, i.e., we expect to have to do e times as many swaps to generate a derangement as we do to generate a random permutation. Like many alternating series, this series converges quickly. It gets within 7 significant figures of e when k = 10!

One silly thing about our algorithm is that, if we place the first element in the first location, we already know that we don't have a derangement, but we continue placing elements until we've created an entire permutation. If we reject illegal placements, we can do even better than a factor of e overhead. It's also possible to come up with a non-rejection based algorithm, but I really enjoy the naive rejection based algorithm because I find it delightful when basic randomized algorithms that consist of "keep trying again" work well.

Appendix: wikipedia's explanation of Sattolo's algorithm

I wrote this explanation because I found the explanation in Wikipedia relatively hard to follow, but if you find the explanation above difficult to understand, maybe you'll prefer wikipedia's version:

The fact that Sattolo's algorithm always produces a cycle of length n can be shown by induction. Assume by induction that after the initial iteration of the loop, the remaining iterations permute the first n - 1 elements according to a cycle of length n - 1 (those remaining iterations are just Sattolo's algorithm applied to those first n - 1 elements). This means that tracing the initial element to its new position p, then the element originally at position p to its new position, and so forth, one only gets back to the initial position after having visited all other positions. Suppose the initial iteration swapped the final element with the one at (non-final) position k, and that the subsequent permutation of first n - 1 elements then moved it to position l; we compare the permutation π of all n elements with that remaining permutation σ of the first n - 1 elements. Tracing successive positions as just mentioned, there is no difference between σ and π until arriving at position k. But then, under π the element originally at position k is moved to the final position rather than to position l, and the element originally at the final position is moved to position l. From there on, the sequence of positions for π again follows the sequence for σ, and all positions will have been visited before getting back to the initial position, as required.

As for the equal probability of the permutations, it suffices to observe that the modified algorithm involves (n-1)! distinct possible sequences of random numbers produced, each of which clearly produces a different permutation, and each of which occurs--assuming the random number source is unbiased--with equal probability. The (n-1)! different permutations so produced precisely exhaust the set of cycles of length n: each such cycle has a unique cycle notation with the value n in the final position, which allows for (n-1)! permutations of the remaining values to fill the other positions of the cycle notation

Thanks to Mathieu Guay-Paquet, Leah Hanson, Rudi Chen, Kamal Marhubi, Michael Robert Arntzenius, Heath Borders, Shreevatsa R, @[email protected], and David Turner for comments/corrections/discussion.


  1. a[0] is placed on the first iteration of the loop. Assuming randrange generates integers with uniform probability in the appropriate range, the original a[0] has 1/n probability of being swapped with any element (including itself), so the resultant a[0] has a 1/n chance of being any element from the original a, which is what we want.

    a[1] is placed on the second iteration of the loop. At this point, a[0] is some element from the array before it was mutated. Let's call the unmutated array original. a[0] is original[k], for some k. For any particular value of k, it contains original[k] with probability 1/n. We then swap a[1] with some element from the range [1, n-1].

    If we want to figure out the probability that a[1] is some particular element from original, we might think of this as follows: a[0] is original[k_0] for some k_0. a[1] then becomes original[k_1] for some k_1 where k_1 != k_0. Since k_0 was chosen uniformly at random, if we integrate over all k_0, k_1 is also uniformly random.

    Another way to look at this is that it's arbitrary that we place a[0] and choose k_0 before we place a[1] and choose k_1. We could just have easily placed a[1] and chosen k_1 first so, over all possible choices, the choice of k_0 cannot bias the choice of k_1.

    [return]

Terminal latency

2017-07-18 08:00:00

There’s a great MSR demo from 2012 that shows the effect of latency on the experience of using a tablet. If you don’t want to watch the three minute video, they basically created a device which could simulate arbitrary latencies down to a fraction of a millisecond. At 100ms (1/10th of a second), which is typical of consumer tablets, the experience is terrible. At 10ms (1/100th of a second), the latency is noticeable, but the experience is ok, and at < 1ms the experience is great, as good as pen and paper. If you want to see a mini version of this for yourself, you can try a random Android tablet with a stylus vs. the current generation iPad Pro with the Apple stylus. The Apple device has well above 10ms end-to-end latency, but the difference is still quite dramatic -- it’s enough that I’ll actually use the new iPad Pro to take notes or draw diagrams, whereas I find Android tablets unbearable as a pen-and-paper replacement.

You can also see something similar if you try VR headsets with different latencies. 20ms feels fine, 50ms feels laggy, and 150ms feels unbearable.

Curiously, I rarely hear complaints about keyboard and mouse input being slow. One reason might be that keyboard and mouse input are quick and that inputs are reflected nearly instantaneously, but I don’t think that’s true. People often tell me that’s true, but I think it’s just the opposite. The idea that computers respond quickly to input, so quickly that humans can’t notice the latency, is the most common performance-related fallacy I hear from professional programmers.

When people measure actual end-to-end latency for games on normal computer setups, they usually find latencies in the 100ms range.

If we look at Robert Menzel’s breakdown of the the end-to-end pipeline for a game, it’s not hard to see why we expect to see 100+ ms of latency:

  • ~2 msec (mouse)
  • 8 msec (average time we wait for the input to be processed by the game)
  • 16.6 (game simulation)
  • 16.6 (rendering code)
  • 16.6 (GPU is rendering the previous frame, current frame is cached)
  • 16.6 (GPU rendering)
  • 8 (average for missing the vsync)
  • 16.6 (frame caching inside of the display)
  • 16.6 (redrawing the frame)
  • 5 (pixel switching)

Note that this assumes a gaming mouse and a pretty decent LCD; it’s common to see substantially slower latency for the mouse and for pixel switching.

It’s possible to tune things to get into the 40ms range, but the vast majority of users don’t do that kind of tuning, and even if they do, that’s still quite far from the 10ms to 20ms range, where tablets and VR start to feel really “right”.

Keypress-to-display measurements are mostly done in games because gamers care more about latency than most people, but I don’t think that most applications are all that different from games in terms of latency. While games often do much more work per frame than “typical” applications, they’re also much better optimized than “typical” applications. Menzel budgets 33ms to the game, half for game logic and half for rendering. How much time do non-game applications take? Pavel Fatin measured this for text editors and found latencies ranging from a few milliseconds to hundreds of milliseconds and he did this with an app he wrote that we can use to measure the latency of other applications that uses java.awt.Robot to generate keypresses and do screen captures.

Personally, I’d like to see the latency of different terminals and shells for a couple of reasons. First, I spend most of my time in a terminal and usually do editing in a terminal, so the latency I see is at least the latency of the terminal. Second, the most common terminal benchmark I see cited (by at least two orders of magnitude) is the rate at which a terminal can display output, often measured by running cat on a large file. This is pretty much as useless a benchmark as I can think of. I can’t recall the last task I did which was limited by the speed at which I can cat a file to stdout on my terminal (well, unless I’m using eshell in emacs), nor can I think of any task for which that sub-measurement is useful. The closest thing that I care about is the speed at which I can ^C a command when I’ve accidentally output too much to stdout, but as we’ll see when we look at actual measurements, a terminal’s ability to absorb a lot of input to stdout is only weakly related to its responsiveness to ^C. The speed at which I can scroll up or down an entire page sounds related, but in actual measurements the two are not highly correlated (e.g., emacs-eshell is quick at scrolling but extremely slow at sinking stdout). Another thing I care about is latency, but knowing that a particular terminal has high stdout throughput tells me little to nothing about its latency.

Let’s look at some different terminals to see if any terminals add enough latency that we’d expect the difference to be noticeable. If we measure the latency from keypress to internal screen capture on my laptop, we see the following latencies for different terminals

Plot of terminal tail latency Plot of terminal tail latency

These graphs show the distribution of latencies for each terminal. The y-axis has the latency in milliseconds. The x-axis is the percentile (e.g., 50 means represents 50%-ile keypress i.e., the median keypress). Measurements are with macOS unless otherwise stated. The graph on the left is when the machine is idle, and the graph on the right is under load. If we just look at median latencies, some setups don’t look too bad -- terminal.app and emacs-eshell are at roughly 5ms unloaded, small enough that many people wouldn’t notice. But most terminals (st, alacritty, hyper, and iterm2) are in the range where you might expect people to notice the additional latency even when the machine is idle. If we look at the tail when the machine is idle, say the 99.9%-ile latency, every terminal gets into the range where the additional latency ought to be perceptible, according to studies on user interaction. For reference, the internally generated keypress to GPU memory trip for some terminals is slower than the time it takes to send a packet from Boston to Seattle and back, about 70ms.

All measurements were done with input only happening on one terminal at a time, with full battery and running off of A/C power. The loaded measurements were done while compiling Rust (as before, with full battery and running off of A/C power, and in order to make the measurements reproducible, each measurement started 15s after a clean build of Rust after downloading all dependencies, with enough time between runs to avoid thermal throttling interference across runs).

If we look at median loaded latencies, other than emacs-term, most terminals don’t do much worse than at idle. But as we look at tail measurements, like 90%-ile or 99.9%-ile measurements, every terminal gets much slower. Switching between macOS and Linux makes some difference, but the difference is different for different terminals.

These measurements aren't anywhere near the worst case (if we run off of battery when the battery is low, and wait 10 minutes into the compile in order to exacerbate thermal throttling, it’s easy to see latencies that are multiple hundreds of ms) but even so, every terminal has tail latency that should be observable. Also, recall that this is only a fraction of the total end-to-end latency.

Why don’t people complain about keyboard-to-display latency the way they complain stylus-to-display latency or VR latency? My theory is that, for both VR and tablets, people have a lot of experience with a much lower latency application. For tablets, the “application” is pen-and-paper, and for VR, the “application” is turning your head without a VR headset on. But input-to-display latency is so bad for every application that most people just expect terrible latency.

An alternate theory might be that keyboard and mouse input are fundamentally different from tablet input in a way that makes latency less noticeable. Even without any data, I’d find that implausible because, when I access a remote terminal in a way that adds tens of milliseconds of extra latency, I find typing to be noticeably laggy. And it turns out that when extra latency is A/B tested, people can and do notice latency in the range we’re discussing here.

Just so we can compare the most commonly used benchmark (throughput of stdout) to latency, let’s measure how quickly different terminals can sink input on stdout:

terminal stdout
(MB/s)
idle50
(ms)
load50
(ms)
idle99.9
(ms)
load99.9
(ms)
mem
(MB)
^C
alacritty 39 31 28 36 56 18 ok
terminal.app 20 6 13 25 30 45 ok
st 14 25 27 63 111 2 ok
alacritty tmux 14
terminal.app tmux 13
iterm2 11 44 45 60 81 24 ok
hyper 11 32 31 49 53 178 fail
emacs-eshell 0.05 5 13 17 32 30 fail
emacs-term 0.03 13 30 28 49 30 ok

The relationship between the rate that a terminal can sink stdout and its latency is non-obvious. For the matter, the relationship between the rate at which a terminal can sink stdout and how fast it looks is non-obvious. During this test, terminal.app looked very slow. The text that scrolls by jumps a lot, as if the screen is rarely updating. Also, hyper and emacs-term both had problems with this test. Emacs-term can’t really keep up with the output and it takes a few seconds for the display to finish updating after the test is complete (the status bar that shows how many lines have been output appears to be up to date, so it finishes incrementing before the test finishes). Hyper falls further behind and pretty much doesn’t update the screen after a flickering a couple of times. The Hyper Helper process gets pegged at 100% CPU for about two minutes and the terminal is totally unresponsive for that entire time.

Alacritty was tested with tmux because alacritty doesn’t support scrolling back up, and the docs indicate that you should use tmux if you want to be able to scroll up. Just to have another reference, terminal.app was also tested with tmux. For most terminals, tmux doesn’t appear to reduce stdout speed, but alacritty and terminal.app are fast enough that they’re actually limited by the speed of tmux.

Emacs-eshell is technically not a terminal, but I also tested eshell because it can be used as a terminal alternative for some use cases. Emacs, with both eshell and term, is actually slow enough that I care about the speed at which it can sink stdout. When I’ve used eshell or term in the past, I find that I sometimes have to wait for a few thousand lines of text to scroll by if I run a command with verbose logging to stdout or stderr. Since that happens very rarely, it’s not really a big deal to me unless it’s so slow that I end up waiting half a second or a second when it happens, and no other terminal is slow enough for that to matter.

Conversely, I type individual characters often enough that I’ll notice tail latency. Say I type at 120wpm and that results in 600 characters per minute, or 10 characters per second of input. Then I’d expect to see the 99.9% tail (1 in 1000) every 100 seconds!

Anyway, the cat “benchmark” that I care about more is whether or not I can ^C a process when I’ve accidentally run a command that outputs millions of lines to the screen instead of thousands of lines. For that benchmark, every terminal is fine except for hyper and emacs-eshell, both of which hung for at least ten minutes (I killed each process after ten minutes, rather than waiting for the terminal to catch up).

Memory usage at startup is also included in the table for reference because that's the other measurement I see people benchmark terminals with. While I think that it's a bit absurd that terminals can use 40MB at startup, even the three year old hand-me-down laptop I'm using has 16GB of RAM, so squeezing that 40MB down to 2MB doesn't have any appreciable affect on user experience. Heck, even the $300 chromebook we recently got has 16GB of RAM.

Conclusion

Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, benchmark the speed of sinking stdout or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink stdout many orders of magnitude faster than humans can keep up with, so further optimizing stdout throughput has a relatively small impact on actual user experience for most users. Likewise for reducing memory usage when an idle terminal uses 0.01% of the memory on my old and now quite low-end laptop.

If you work on a terminal, perhaps consider relatively more latency and interactivity (e.g., responsiveness to ^C) optimization and relatively less throughput and idle memory usage optimization.

Update: In response to this post, the author of alacritty explains where alacritty's latency comes from and describes how alacritty could reduce its latency

Appendix: negative results

Tmux and latency: I tried tmux and various terminals and found that the the differences were within the range of measurement noise.

Shells and latency: I tried a number of shells and found that, even in the quickest terminal, the difference between shells was within the range of measurement noise. Powershell was somewhat problematic to test with the setup I was using because it doesn’t handle colors correctly (the first character typed shows up with the color specified by the terminal, but other characters are yellow regardless of setting, which appears to be an open issue), which confused the image recognition setup I used. Powershell also doesn’t consistently put the cursor where it should be -- it jumps around randomly within a line, which also confused the image recognition setup I used. However, despite its other problems, powershell had comparable performance to other shells.

Shells and stdout throughput: As above, the speed difference between different shells was within the range of measurement noise.

Single-line vs. multiline text and throughput: Although some text editors bog down with extremely long lines, throughput was similar when I shoved a large file into a terminal whether the file was all one line or was line broken every 80 characters.

Head of line blocking / coordinated omission: I ran these tests with input at a rate of 10.3 characters per second. But it turns out this doesn't matter much and input rates that humans are capapable of and the latencies are quite similar to doing input once every 10.3 seconds. It's possible to overwhelm a terminal, and hyper is the first to start falling over at high input rates, but the speed necessary to make the tail latency worse is beyond the rate at which any human I know of can type.

Appendix: experimental setup

All tests were done on a dual core 2.6GHz 13” Mid-2014 Macbook pro. The machine has 16GB of RAM and a 2560x1600 screen. The OS X version was 10.12.5. Some tests were done in Linux (Lubuntu 16.04) to get a comparison between macOS and Linux. 10k keypresses were for each latency measurements.

Latency measurements were done with the . key and throughput was done with default base32 output, which is all plain ASCII text. George King notes that different kinds of text can change output speed:

I’ve noticed that Terminal.app slows dramatically when outputting non-latin unicode ranges. I’m aware of three things that might cause this: having to load different font pages, and having to parse code points outside of the BMP, and wide characters.

The first probably boils down to a very complicated mix of lazy loading of font glyphs, font fallback calculations, and caching of the glyph pages or however that works.

The second is a bit speculative, but I would bet that Terminal.app uses Cocoa’s UTF16-based NSString, which almost certainly hits a slow path when code points are above the BMP due to surrogate pairs.

Terminals were fullscreened before running tests. This affects test results, and resizing the terminal windows can and does significantly change performance (e.g., it’s possible to get hyper to be slower than iterm2 by changing the window size while holding everything else constant). st on macOS was running as an X client under XQuartz. To see if XQuartz is inherently slow, I tried runes, another "native" Linux terminal that uses XQuartz; runes had much better tail latency than st and iterm2.

The “idle” latency tests were done on a freshly rebooted machine. All terminals were running, but input was only fed to one terminal at a time.

The “loaded” latency tests were done with rust compiling in the background, 15s after the compilation started.

Terminal bandwidth tests were done by creating a large, pseudo-random, text file with

timeout 64 sh -c 'cat /dev/urandom | base32 > junk.txt'

and then running

timeout 8 sh -c 'cat junk.txt | tee junk.term_name'

Terminator and urxvt weren’t tested because they weren’t completely trivial to install on mac and I didn’t want to futz around to make them work. Terminator was easy to build from source, but it hung on startup and didn’t get to a shell prompt. Urxvt installed through brew, but one of its dependencies (also installed through brew) was the wrong version, which prevented it from starting.

Thanks to Kamal Marhubi, Leah Hanson, Wesley Aptekar-Cassels, David Albert, Vaibhav Sagar, Indradhanush Gupta, Rudi Chen, Laura Lindzey, Ahmad Jarara, George King, Tim Dierks, Nikith Naide, Veit Heller, and Nick Bergson-Shilcock for comments/corrections/discussion.

The widely cited studies on mouse vs. keyboard efficiency are completely bogus

2017-06-13 08:00:00

Which is faster, keyboard or mouse? A large number of programmers believe that the keyboard is faster for all (programming-related) tasks. However, there are a few widely cited webpages on AskTog which claim that Apple studies show that using the mouse is faster than using the keyboard for everything and that people who think that using the keyboard is faster are just deluding themselves. This might sound extreme, but, just for example, one page says that the author has “never seen [the keyboard] outperform the mouse”.

But it can’t be the case that the mouse is faster for everything — almost no one is faster at clicking on an on-screen keyboard with a mouse than typing at a physical keyboard. Conversely, there are tasks for which mice are much better suited than keyboards (e.g., aiming in FPS games). For someone without an agenda, the question shouldn’t be, which is faster at all tasks, but which tasks are faster with a keyboard, which are faster with a mouse, and which are faster when both are used?

You might ask if any of this matters. It depends! One of the best programmrers I know is a hunt-and-peck typist, so it's clearly possible to be a great programmer without having particularly quick input speed. But I'm in the middle of an easy data munging task where I'm limited by the speed at which I can type in a large amount of boring code. If I were quicker, this task would be quicker, and there are tasks that I don't do that I might do. I can type at > 100 wpm, which isn't bad, but I can talk at > 400 wpm and I can think much faster than I can talk. I'm often rate limited even when talking; typing is much worse and the half-a-second here and one-second there I spent on navigation certainly doesn't help. When I first got started in tech, I had a mundane test/verification/QA role where my primary job was to triage test failures. Even before I started automating tasks, I could triage nearly twice as many bugs per day as other folks in the same role because I took being efficient at basic navigation tasks seriously. Nowadays, my jobs aren't 90% rote anymore, but my guess is that about a third of the time I spend in front of a computer is spent on mindless tasks that are rate-limited by my input and navigation speed. If I could get faster at those mundane tasks and have to spend less time on them and more time doing things that are fun, that would be great.

Anyway, to start, let’s look at the cited studies to see where the mouse is really faster. Most references on the web, when followed all the way back, point to the AskTog, a site by Bruce Tognazzini, who describes himself as a "recognized leader in human/computer interaction design".

The most cited AskTog page on the topic claims that they've spent $50M of R&D and done all kinds of studies; the page claims that, among other things, the $50M in R&D showed “Test subjects consistently report that keyboarding is faster than mousing” and “The stopwatch consistently proves mousing is faster than keyboarding. ”. The claim is that this both proves that the mouse is faster than the keyboard, and explains why programmers think the keyboard is faster than the mouse even though it’s slower. However, the result is unreproducible because “Tog” not only doesn’t cite the details of the experiments, Tog doesn’t even describe the experiments and just makes a blanket claim.

The second widely cited AskTog page is in response to a response to the previous page, and it simply repeats that the first page showed that keyboard shortcuts are slower. While there’s a lot of sarcasm, like “Perhaps we have all been misled these years. Perhaps the independent studies that show over and over again that Macintosh users are more productive, can learn quicker, buy more software packages, etc., etc., etc., are somehow all flawed. Perhaps....” no actual results are cited, as before. There is, however, a psuedo-scientific explanation of why the mouse is faster than the keyboard:

Command Keys Aren’t Faster. As you know from my August column, it takes just as long to decide upon a command key as it does to access the mouse. The difference is that the command-key decision is a high-level cognitive function of which there is no long-term memory generated. Therefore, subjectively, keys seem faster when in fact they usually take just as long to use.

Since mouse acquisition is a low-level cognitive function, the user need not abandon cognitive process on the primary task during the acquisition period. Therefore, the mouse acquirer achieves greater productivity.

One question this raises is, why should typing on the keyboard be any different from using command keys? There certainly are people who aren’t fluent at touch typing who have to think about which key they’re going to press when they type. Those people are very slow typists, perhaps even slower than someone who’s quick at using the mouse to type via an on screen keyboard. But there are also people who are fluent with the keyboard and can type without consciously thinking about which keys they’re going to press. The implicit claim here is that it’s not possible to be fluent with command keys in the same way it’s possible to be fluent with the keyboard for typing. It’s possible that’s true, but I find the claim to be highly implausible, both in principle, and from having observed people who certainly seem to be fluent with command keys, and the claim has no supporting evidence.

The third widely cited AskTog page cites a single experiment, where the author typed a paragraph and then had to replace every “e” with a “|”, either using cursor keys or the mouse. The author found that the average time for using cursor keys was 99.43 seconds and the average time for the mouse was 50.22 seconds. No information about the length of the paragraph or the number of “e”s was given. The third page was in response to a user who cited specific editing examples where they found that they were faster with a keyboard than with a mouse.

My experience with benchmarking is that the vast majority of microbenchmarks have wrong or misleading results because they’re difficult to set up properly, and even when set up properly, understanding how the microbenchmark results relate to real-world world results requires a deep understanding of the domain. As a result, I’m deeply skeptical of broad claims that come from microbenchmarks unless the author has a demonstrated, deep, understanding of benchmarking their particular domain, and even then I’ll ask why they believe their result generalizes. The opinion that microbenchmarks are very difficult to interpret properly is widely shared among people who understand benchmarking.

The e -> | replacement task described is not only a microbenchmark, it's a bizarrely artificial microbenchmark.

Based on the times given in the result, the task was either for very naive users, or disallowed any kind of search and replace functionality. This particular AskTog column is in response to a programmer who mentioned editing tasks, so the microbenchmark is meaningless unless that programmer is trapped in an experiment where they’re not allowed to use their editor’s basic functionality. Moreover, the replacement task itself is unrealistic — how often do people replace e with |?

I timed this task without the bizarre no-search-and-replace restriction removed and got the following results:

  • Keyboard shortcut: 1.26s
  • M-x, “replace-string” (instead of using mapped keyboard shortcut): 2.8s
  • Navigate to search and replace with mouse: 5.39s

The first result was from using a keyboard shortcut. The second result is something I might do if I were in someone else’s emacs setup, which has different keyboard shortcuts mapped; emacs lets you run a command by hitting “M-x” and typing the entire name of the command. That’s much slower than using a keyboard shortcut directly, but still faster than using the mouse (at least for me, here) Does this mean that keyboards are great and mice are terrible? No, the result is nearly totally meaningless because I spend almost none of my time doing single-character search-and-replace, making the speed of single-character search-and-replace irrelevant.

Also, since I’m used to using the keyboard, the mouse speed here is probably unusually slow. That’s doubly true here because my normal editor setup (emacs -nw) doesn’t allow for mouse usage, so I ended up using an unfamiliar editor, TextEdit, for the mouse test. I did each task once in order to avoid “practicing” the exact task, which could unrealistically make the keyboard-shortcut version nearly instantaneous because it’s easy to hit a practiced sequence of keys very quickly. However, this meant that I was using an unfamiliar mouse in an unfamiliar set of menus for the mouse. Furthermore, like many people who’ve played video games in the distant past, I’m used to having “mouse acceleration” turned off, but the Mac has this on by default and I didn’t go through the rigmarole necessary to disable mouse acceleration. Additionally, recording program I used (quicktime) made the entire machine laggy, which probably affects mousing speed more than keyboard speed, and the menu setup for the program I happened to use forced me to navigate through two levels of menus.

That being said, despite not being used to the mouse, if I want to find a microbenchmark where I’m faster with the mouse than with the keyboard, that’s easy: let me try selecting a block of text that’s on the screen but not near my cursor:

  • Keyboard: 1.8s
  • Mouse: 0.7s

I tend to do selection of blocks in emacs by searching for something at the start of the block, setting a mark, and then searching for something at the end of the mark. I typically type three characters to make sure that I get a unique chunk of text (and I’ll type more if it’s text where I don’t think three characters will cut it). This makes the selection task somewhat slower than the replacement task because the replacement task used single characters and this task used multiple characters.

The mouse is so much better suited for selecting a block of text that even with an unfamiliar mouse setup where I end up having to make a correction instead of being able to do the selection in one motion, the mouse is over twice as fast. But, if I wanted select something that was off screen and the selection was so large that it wouldn’t fit on one screen, the keyboard time wouldn’t change and the mouse time would get much slower, making the keyboard faster.

In addition to doing the measurements, I also (informally) polled people to ask if they thought the keyboard or the mouse would be faster for specific tasks. Both search-and-replace and select-text are tasks where the result was obvious to most people. But not all tasks are obvious; scrolling was one where people didn’t have strong opinions one way or another. Let’s look at scrolling, which is a task both the keyboard and the mouse are well suited for. To have something concrete, let’s look at scrolling down 4 pages:

  • Keyboard: 0.49s
  • Mouse: 0.57s

While there’s some difference, and I suspect that if I repeated the experiment enough times I could get a statistically significant result, but the difference is small enough that the difference isn’t of practical significance.

Contra Tog’s result, which was that everyone believes the keyboard was faster even though the mouse is faster, I find that people are pretty good at estimating what’s which device is faster for which tasks and also at estimate when both devices will give a similar result. One possible reason is that I’m polling programmers, and in particular, programmers at RC, who are probably a different population than whoever Tog might’ve studied in his studies. He was in a group that was looking at how to design the UI for a general purpose computer in the 80s, where it would have been actually been unreasonable to focus on studying people, many of whom grew up using computers, and then chose a career where you use computers all day. The equivalent population would’ve had to start using computers in the 60s or even earlier, but even if they had, input devices were quite different (the ball mouse wasn’t invented until 1972, and it certainly wasn’t in wide use the moment it was invented). There’s nothing wrong with studying populations who aren’t relatively expert at using computer input devices, but there is something wrong with generalizing those results to people who are relatively expert.

Unlike claims by either keyboard or mouse advocates, when I do experiments myself, the results are mixed. Some tasks are substantially faster if I use the keyboard and some are substantially faster if I use the mouse. Moreover, most of the results are easily predictable (when the results are similar, the prediction is that it would be hard to predict). If we look at the most widely cited, authoritative, results on the web, we find that they make very strong claims that the mouse is much faster than the keyboard but back up the claim with nothing but a single, bogus, experiment. It’s possible that some of the vaunted $50M in R&D went into valid experiments, but those experiments, if they exist, aren’t cited.

I spent some time reviewing the literature on the subject, but couldn’t find anything conclusive. Rather than do a point-by-point summary of each study (like I did here for here for another controversial topic), I’ll mention the high-level issues that make the studies irrelevant to me. All studies I could find had at least one of the issues listed below; if you have a link to a study that isn’t irrelevant for one of the following reasons, I’d love to hear about it!

  1. Age of study: it’s unclear how a study on interacting with computers from the mid-80s transfers to how people interact with computers today. Even ignoring differences in editing programs, there are large differences in the interface. Mice are more precise and a decent modern optical mouse can be moved as fast as a human can move it without the tracking becoming erratic, something that isn’t true of any mouse I’ve tried from the 80s and was only true of high quality mice from the 90s when the balls were recently cleaned and the mouse was on a decent quality mousepad. Keyboards haven’t improved as much, but even so, I can type substantially faster a modern, low-travel, keyboard than on any keyboard I’ve tried from the 80s.
  2. Narrow microbenchmarking: not all of these are as irrelevant as the e -> | without search and replace task, but even in the case of tasks that aren’t obviously irrelevant, it’s not clear what the impact of the result is on actual work I might do.
  3. Not keyboard vs. mouse: a tiny fraction of published studies are on keyboard vs. mouse interaction. When a study is on device interaction, it’s often about some new kind of device or a new interaction model.
  4. Vague description: a lot of studies will say something like they found a 7.8% improvement, with results being significant with p < 0.005, without providing enough information to tell if the results are actually significant or merely statistically significant (recall that the practically insignificant scrolling result was a 0.08s difference, which could also be reported as a 16.3% improvement).
  5. Unskilled users: in one, typical, paper, they note that it can take users as long as two seconds to move the mouse from one side of the screen to a scrollbar on the other side of the screen. While there’s something to be said for doing studies on unskilled users in order to figure out what sorts of interfaces are easiest for users who have the hardest time, a study on users who take 2 seconds to get their mouse onto the scrollbar doesn’t appear to be relevant to my user experience. When I timed this for myself, it took 0.21s to get to the scrollbar from the other side of the screen and scroll a short distance, despite using an unfamiliar mouse with different sensitivity than I’m used to and running a recording program which made mousing more difficult than usual.
  6. Seemingly unreasonable results: some studies claim to show large improvements in overall productivity when switching from type of device to another (e.g., a 20% total productivity gain from switching types of mice).

Conclusion

It’s entirely possible that the mysterious studies Tog’s org spent $50M on prove that the mouse is faster than the keyboard for all tasks other than raw text input, but there doesn’t appear to be enough information to tell what the actual studies were. There are many public studies on user input, but I couldn’t find any that are relevant to whether or not I should use the mouse more or less at the margin.

When I look at various tasks myself, the results are mixed, and they’re mixed in the way that most programmers I polled predicted. This result is so boring that it would barely be worth mentioning if not for the large groups of people who believe that either the keyboard is always faster than the mouse or vice versa.

Please let me know if there are relevant studies on this topic that I should read! I’m not familiar with the relevant fields, so it’s possible that I’m searching with the wrong keywords and reading the wrong papers.

Appendix: note to self

I didn't realize that scrolling was so fast relative to searching (not explicitly mentioned in the blog post, but 1/2 of the text selection task). I tend to use search to scroll to things that are offscreen, but it appears that I should consider scrolling instead when I don't want to drop my cursor in a specific position.

Thanks to Leah Hanson, Quentin Pradet, Alex Wilson, and Gaxun for comments/corrections on this post and to Annie Cherkaev, Chris Ball, Stefan Lesser, and David Isaac Lee for related discussion.

Startup options v. cash

2017-06-07 08:00:00

I often talk to startups that claim that their compensation package has a higher expected value than the equivalent package at a place like Facebook, Google, Twitter, or Snapchat. One thing I don’t understand about this claim is, if the claim is true, why shouldn’t the startup go to an investor, sell their options for what they claim their options to be worth, and then pay me in cash? The non-obvious value of options combined with their volatility is a barrier for recruiting.

Additionally, given my risk function and the risk function of VCs, this appears to be a better deal for everyone. Like most people, extra income gives me diminishing utility, but VCs have an arguably nearly linear utility in income. Moreover, even if VCs shared my risk function, because VCs hold a diversified portfolio of investments, the same options would be worth more to them than they are to me because they can diversify away downside risk much more effectively than I can. If these startups are making a true claim about the value of their options, there should be a trade here that makes all parties better off.

In a classic series of essays written a decade ago, seemingly aimed at convincing people to either found or join startups, Paul Graham stated "If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years" and "Risk and reward are always proportionate." This risk-reward assertion is used to back the claim that people can make more money, in expectation, by joining startups and taking risky equity packages than they can by taking jobs that pay cash or cash plus public equity. However, the premise — that risk and reward are always proportionate — isn’t true in the general case. It's basic finance 101 that only assets whose risk cannot be diversified away carry a risk premium (on average). Since VCs can and do diversify risk away, there’s no reason to believe that an individual employee who “invests” in startup options by working at a startup is getting a deal because of the risk involved. And by the way, when you look at historical returns, VC funds don’t appear to outperform other investment classes even though they get to buy a kind of startup equity that has less downside risk than the options you get as a normal employee.

So how come startups can’t or won’t take on more investment and pay their employees in cash? Let’s start by looking at some cynical reasons, followed by some less cynical reasons.

Cynical reasons

One possible answer, perhaps the simplest possible answer, is that options aren’t worth what startups claim they’re worth and startups prefer options because their lack of value is less obvious than it would be with cash. A simplistic argument that this might be the case is, if you look at the amount investors pay for a fraction of an early-stage or mid-stage startup and look at the extra cash the company would have been able to raise if they gave their employee option pool to investors, it usually isn’t enough to pay employees competitive compensation packages. Given that VCs don’t, on average, have outsized returns, this seems to imply that employee options aren’t worth as much as startups often claim. Compensation is much cheaper if you can convince people to take an arbirary number of lottery tickets in a lottery of unknown value instead of cash.

Some common ways that employee options are misrepresented are:

Strike price as value

A company that gives you 1M options with a strike price of $10 might claim that those are “worth” $10M. However, if the share price stays at $10 for the lifetime of the option, the options will end up being worth $0 because an option with a $10 strike price is an option to buy the stock at $10, which is not the same as a grant of actual shares worth $10 a piece.

Public valuation as value

Let’s say a company raised $300M by selling 30% of the company, giving the company an implied valuation of $1B. The most common misrepresentation I see is that the company will claim that because they’re giving an option for, say, 0.1% of the company, your option is worth $1B * 0.001 = $1M. A related, common, misrepresentation is that the company raised money last year and has increased in value since then, e.g., the company has since doubled in value, so your option is worth $2M. Even if you assume the strike price was $0 and and go with the last valuation at which the company raised money, the implied value of your option isn’t $1M because investors buy a different class of stock than you get as an employee.

There are a lot of differences between the preferred stock that VCs get and the common stock that employees get; let’s look at a couple of concrete scenarios.

Let’s say those investors that paid $300M for 30% of the company have a straight (1x) liquidation preference, and the company sells for $500M. The 1x liquidation preference means that the investors will get 1x of their investment back before lowly common stock holders get anything, so the investors will get $300M for their 30% of the company. The other 70% of equity will split $200M: your 0.1% common stock option with a $0 strike price is worth $285k (instead of the $500k you might expect it to be worth if you multiply $500M by 0.001).

The preferred stock VCs get usually has at least a 1x liquidation preference. Let’s say the investors had a 2x liquidation preference in the above scenario. They would get 2x their investment back before the common stockholders split the rest of the company. Since 2 * $300M is greater than $500M, the investors would get everything and the remaining equity holders would get $0.

Another difference between your common stock and preferred stock is that preferred stock sometimes comes with an anti-dilution clause, which you have no chance of getting as a normal engineering hire. Let’s look at an actual example of dilution at a real company. Mayhar got 0.4% of a company when it was valued at $5M. By the time the company was worth $1B, Mayhar’s share of the company was diluted by 8x, which made his share of the company worth less than $500k (minus the cost of exercising his options) instead of $4M (minus the cost of exercising his options).

This story has a few additional complications which illustrate other reasons options are often worth less than they seem. Mayhar couldn’t afford to exercise his options (by paying the strike price times the number of shares he had an option for) when he joined, which is common for people who take startup jobs out of college who don’t come from wealthy families. When he left four years later, he could afford to pay the cost of exercising the options, but due to a quirk of U.S. tax law, he either couldn’t afford the tax bill or didn’t want to pay that cost for what was still a lottery ticket — when you exercise your options, you’re effectively taxed on the difference between the current valuation and the strike price. Even if the company has a successful IPO for 10x as much in a few years, you’re still liable for the tax bill the year you exercise (and if the company stays private indefinitely or fails, you get nothing but a future tax deduction). Because, like most options, Mayhar’s option has a 90-day exercise window, he didn’t get anything from his options.

While that’s more than the average amount of dilution, there are much worse cases, for example, cases where investors and senior management basically get to keep their equity and everyone else gets diluted to the point where their equity is worthless.

Those are just a few of the many ways in which the differences between preferred and common stock can cause the value of options to be wildly different from a value naively calculated from a public valuation. I often see both companies and employees use public preferred stock valuations as a benchmark in order to precisely value common stock options, but this isn’t possible, even in principle, without access to a company’s cap table (which shows how much of the company different investors own) as well as access to the specific details of each investment. Even if you can get that (which you usually can’t), determining the appropriate numbers to plug into a model that will give you the expected value is non-trivial because it requires answering questions like “what’s the probability that, in an acquisition, upper management will collude with investors to keep everything and leave the employees with nothing?”

Black-Scholes valuation as value

Because of the issues listed above, people will sometimes try to use a model to estimate the value of options. Black-Scholes is commonly used because well known and has an easy to use closed form solution, it’s the most commonly used model. Unfortunately, most of the major assumptions for Black-Scholes are false for startup options, making the relationship between the output between Black-Scholes and the actual value of your options non-obvious.

Options are often free to the company

A large fraction of options get returned to the employee option pool when employees leave, either voluntarily or involuntarily. I haven’t been able to find comprehensive numbers on this, but anecdotally, I hear that more than 50% of options end up getting taken back from employees and returned to the general pool. Dan McKinley points out an (unvetted) analysis that shows that only 5% of employee grants are exercised. Even with a conservative estimate, a 50% discount on options granted sounds pretty good. A 20x discount sounds amazing, and would explain why companies like options so much.

Present value of a future sum of money

When someone says that a startup’s compensation package is worth as much as Facebook’s, they often mean that the total value paid out over N years is similar. But a fixed nominal amount of money is worth more the sooner you get it because you can (at a minimum) invest it in a low-risk asset, like Treasury bonds, and get some return on the money.

That’s an abstract argument you’ll hear in an econ 101 class, but in practice, if you live somewhere with a relatively high cost of living, like SF or NYC, there’s an even greater value to getting paid sooner rather than later because it lets you live in a relatively nice place (however you define nice) without having to cram into a space with more roommates than would be considered reasonable elsewhere in the U.S. Many startups from the last two generations seem to be putting off their IPOs; for folks in those companies with contracts that prevent them from selling options on a secondary market, that could easily mean that the majority of their potential wealth is locked up for the first decade of their working life. Even if the startup’s compensation package is worth more when adjusting for inflation and interest, it’s not clear if that’s a great choice for most people who aren’t already moderately well off.

Non-cynical reasons

We’ve looked at some cynical reasons companies might want to offer options instead of cash, namely that they can claim that their options are worth more than they’re actually worth. Now, let’s look at some non-cynical reasons companies might want to give out stock options.

From an employee standpoint, one non-cynical reason might have been stock option backdating, at least until that loophole was mostly closed. Up until late early 2000s, many companies backdated the date of options grants. Let’s look at this example, explained by Jessie M. Fried

Options covering 1.2 million shares were given to Reyes. The reported grant date was October 1, 2001, when the firm's stock was trading at around $13 per share, the lowest closing price for the year. A week later, the stock was trading at $20 per share, and a month later the stock closed at almost $26 per share.

Brocade disclosed this grant to investors in its 2002 proxy statement in a table titled "Option Grants in the Last Fiscal Year, prepared in the format specified by SEC rules. Among other things, the table describes the details of this and other grants to executives, including the number of shares covered by the option grants, the exercise price, and the options' expiration date. The information in this table is used by analysts, including those assembling Standard & Poor's well-known ExecuComp database, to calculate the Black Scholes value for each option grant on the date of grant. In calculating the value, the analysts assumed, based on the firm's representations about its procedure for setting exercise prices, that the options were granted at-the-money. The calculated value was then widely used by shareholders, researchers, and the media to estimate the CEO's total pay. The Black Scholes value calculated for Reyes' 1.2 million stock option grant, which analysts assumed was at-the-money, was $13.2 million.

However, the SEC has concluded that the option grant to Reyes was backdated, and the market price on the actual date of grant may have been around $26 per share. Let us assume that the stock was in fact trading at $26 per share when the options were actually granted. Thus, if Brocade had adhered to its policy of giving only at-the-money options, it should have given Reyes options with a strike price of $26 per share. Instead, it gave Reyes options with a strike price of $13 per share, so that the options were $13 in the money. And it reported the grant as if it had given Reyes at-the-money options when the stock price was $13 per share.

Had Brocade given Reyes at-the-money options at a strike price of $26 per share, the Black Scholes value of the option grant would have been approximately $26 million. But because the options were $13 million in the money, they were even more valuable. According to one estimate, they were worth $28 million. Thus, if analysts had been told that Reyes received options with a strike price of $13 when the stock was trading for $26, they would have reported their value as $28 million rather than $13.2 million. In short, backdating this particular option grant, in the scenario just described, would have enabled Brocade to give Reyes $2 million more in options (Black Scholes value) while reporting an amount that was $15 million less.

While stock options backdating isn’t (easily) possible anymore, there might be other loopholes or consequences of tax law that make options a better deal than cash. I could only think of one reason off the top of my head, so I spent a couple weeks asking folks (including multiple founders) for their non-cynical reasons why startups might prefer options to an equivalent amount of cash.

Tax benefit of ISOs

In the U.S., Incentive stock options (ISOs) have the property that, if held for one year after the exercise date and two years after the grant date, the owner of the option pays long-term capital gains tax instead of ordinary income tax on the difference between the exercise price and the strike price. In general, the capital gains has a lower tax rate than ordinary income.

This isn’t quite as good as it sounds because the difference between the exercise price and the strike price is subject to the Alternative Minimum Tax (AMT). I don’t find this personally relevant since I prefer to sell employer stock as quickly as possible in order to be as diversified as possible, but if you’re interested in figuring out how the AMT affects your tax bill when you exercise ISOs, see this explanation for more details. For people in California, California also has a relatively poor treatment of capital gains at the state level, which also makes this difference smaller than you might expect from looking at capital gains vs. ordinary income tax rates.

Tax benefit of QSBS

There’s a certain class of stock that is exempt from federal capital gains tax and state tax in many states (though not in CA). This is interesting, but it seems like people rarely take advantage of this when eligible, and many startups aren’t eligible.

Tax benefit of other options

The IRS says:

Most nonstatutory options don't have a readily determinable fair market value. For nonstatutory options without a readily determinable fair market value, there's no taxable event when the option is granted but you must include in income the fair market value of the stock received on exercise, less the amount paid, when you exercise the option. You have taxable income or deductible loss when you sell the stock you received by exercising the option. You generally treat this amount as a capital gain or loss.

Valuations are bogus

One quirk of stock options is that, to qualify as ISOs, the strike price must be at least the fair market value. That’s easy to determine for public companies, but the fair market value of a share in a private company is somewhat arbitrary. For ISOs, my reading of the requirement is that companies must make “an attempt, made in good faith” to determine the fair market value. For other types of options, there’s other regulation which which determines the definition of fair market value. Either way, startups usually go to an outside firm between 1 and N times a year to get an estimate of the fair market value for their common stock. This results in at least two possible gaps between a hypothetical “real” valuation and the fair market value for options purposes.

First, the valuation is updated relatively infrequently. A common pitch I’ve heard is that the company hasn’t had its valuation updated for ages, and the company is worth twice as much now, so you’re basically getting a 2x discount.

Second, the firms doing the valuations are poorly incentivized to produce “correct” valuations. The firms are paid by startups, which gain something when the legal valuation is as low as possible.

I don’t really believe that these things make options amazing, because I hear these exact things from startups and founders, which means that their offers take these into account and are priced accordingly. However, if there’s a large gap between the legal valuation and the “true” valuation and this allows companies to effectively give out higher compensation, the way stock option backdating did, I could see how this would tilt companies towards favoring options.

Control

Even if employees got the same class of stock that VCs get, founders would retain less control if they transferred the equity from employees to VCs because employee-owned equity is spread between a relatively large number of people.

Retention

This answer was commonly given to me as a non-cynical reason. The idea is that, if you offer employees options and have a clause that prevents them from selling options on a secondary market, many employees won’t be able to leave without walking away from the majority of their compensation. Personally, this strikes me as a cynical reason, but that’s not how everyone sees it. For example, Andreessen Horowitz managing partner Scott Kupor recently proposed a scheme under which employees would lose their options under all circumstances if they leave before a liquidity event, supposedly in order to help employees.

Whether or not you view employers being able to lock in employees for indeterminate lengths of time as good or bad, options lock-in appears to be a poor retention mechanism — companies that pay cash seem to have better retention. Just for example, Netflix pays salaries that are comparable to the total compensation in the senior band at places like Google and, anecdotally, they seem to have less attrition than trendy Bay Area startups. In fact, even though Netflix makes a lot of noise about showing people the door if they’re not a good fit, they don’t appear to have a higher involuntary attrition rate than trendy Bay Area startups — they just seem more honest about it, something which they can do because their recruiting pitch doesn’t involve you walking away with below-market compensation if you leave. If you think this comparison is unfair because Netflix hasn’t been a startup in recent memory, you can compare to finance startups, e.g. Headlands, which was founded in the same era as Uber, Airbnb, and Stripe. They (and some other finance startups) pay out hefty sums of cash and this does not appear to result in higher attrition than similarly aged startups which give out illiquid option grants.

In the cases where this results in the employee staying longer than they otherwise would, options lock-in is often a bad deal for all parties involved. The situation is obviously bad for employees and, on average, companies don’t want unhappy people who are just waiting for a vesting cliff or liquidity event.

Incentive alignment

Another commonly stated reason is that, if you give people options, they’ll work harder because they’ll do well when the company does well. This was the reason that was given most vehemently (“you shouldn’t trust someone who’s only interested in a paycheck”, etc.)

However, as far as I can tell, paying people in options almost totally decouples job performance and compensation. If you look at companies that have made a lot of people rich, like Microsoft, Google, Apple, and Facebook, almost none of the employees who became rich had an instrumental role in the company’s success. Google and Microsoft each made thousands of people rich, but the vast majority of those folks just happened to be in the right place at the right time and could have just as easily taken a different job where they didn't get rich. Conversely, the vast majority of startup option packages end up being worth little to nothing, but nearly none of the employees whose options end up being worthless were instrumental in causing their options to become worthless.

If options are a large fraction of compensation, choosing a company that’s going to be successful is much more important than working hard. For reference, Microsoft is estimated to have created roughly 10^3 millionaires by 1992 (adjusted for inflation, that's $1.75M). The stock then went up by more than 20x. Microsoft was legendary for making people who didn't particularly do much rich; all told, it's been estimated that they made 10^4 people rich by the late 90s. The vast majority of those people were no different from people in similar roles at Microsoft's competitors. They just happened to pick a winning lottery ticket. This is the opposite of what founders claim they get out of giving options. As above, companies that pay cash, like Netflix, don’t seem to have a problem with employee productivity.

By the way, a large fraction of the people who were made rich by working at Microsoft joined after their IPO, which was in 1986. The same is true of Google, and while Facebook is too young for us to have a good idea what the long-term post-IPO story is, the folks who joined a year or two after the IPO (5 years ago, in 2012) have done quite well for themselves. People who joined pre-IPO have done better, but as mentioned above, most people have diminishing returns to individual wealth. The same power-law-like distribution that makes VC work also means that it's entirely plausible that Microsoft alone made more post-IPO people rich from 1986-1999 than all pre-IPO tech companies combined during that period. Something similar is plausibly true for Google from 2004 until FB's IPO in 2012, even including the people who got rich from FB's IPO as people who were made rich by a pre-IPO company, and you can do a similar calculation for Apple.

VC firms vs. the market

There are several potential counter-arguments to the statement that VC returns (and therefore startup equity) don’t beat the market.

One argument is, when people say that, they typically mean that after VCs take their fees, returns to VC funds don’t beat the market. As an employee who gets startup options, you don’t (directly) pay VC fees, which means you can beat the market by keeping the VC fees for yourself.

Another argument is that, some investors (like YC) seem to consistently do pretty well. If you join a startup that’s funded by a savvy investors, you too can do pretty well. For this to make sense, you have to realize that the company is worth more than “expected” while the company doesn’t have the same realization because you need the company to give you an option package without properly accounting for its value. For you to have that expectation and get a good deal, this requires the founders to not only not be overconfident in the company’s probability of success, but actually requires that the founders are underconfident. While this isn’t impossible, the majority of startup offers I hear about have the opposite problem.

Investing

This section is an update written in 2020. This post was originally written when I didn't realize that it was possible for people who aren't extremely wealthy to invest in startups. But once I moved to SF, I found that it's actually very easy to invest in startups and that you don't have to be particularly wealthy (for a programmer) to do so — people will often take small checks (as small as $5k or sometimes even less) in seed rounds. If you can invest directly in a seed round, this is a strictly better deal than joining as an early employee.

As of this writing, it's quite common for companies to raise a seed round at a $10M valuation. This meeans you'd have to invest $100k to get 1%, or about as much equity as you'd expect to get as a very early employee. However, if you were to join the company, your equity would vest over four years, you'd get a worse class of equity, and you'd (typically) get much less information about the share structure of the company. As an investor, you only need to invest $25k to get 1 year's worth of early employee equity. Morever, you can invest in multiple companies, which gives you better risk adjusted return. At rates big companies are paying today (mid-band of perhaps $380k/yr for senior engineer, $600k/yr for staff engineer), working at a big company and spending $25k/yr investing in startups is strictly superior to working at a startup from the standpoint of financial return.

Conclusion

There are a number of factors that can make options more or less valuable than they seem. From an employee standpoint, the factors that make options more valuable than they seem can cause equity to be worth tens of percent more than a naive calculation. The factors that make options less valuable than they seem do so in ways that mostly aren’t easy to quantify.

Whether or not the factors that make options relatively more valuable dominate or the factors that make options relatively less valuable dominate is an empirical question. My intuition is that the factors that make options relatively less valuable are stronger, but that’s just a guess. A way to get an idea about this from public data would be to go through through successful startup S-1 filing. Since this post is already ~5k words, I’ll leave that for another post, but I’ll note that in my preliminary skim of a handful of 99%-ile exits (> $1B), the median employee seems to do worse than someone who’s on the standard Facebook/Google/Amazon career trajectory.

From a company standpoint, there are a couple factors that allow companies to retain more leverage/control by giving relatively more options to employees and relatively less equity to investors.

All of this sounds fine for founders and investors, but I don’t see what’s in it for employees. If you have additional reasons that I’m missing, I’d love to hear them.

_If you liked this post, you may also like this other post on the tradeoff between working at a big company and working at a startup.

Appendix: caveats

Many startups don’t claim that their offers are financially competitive. As time goes on, I hear less “If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years.” and more “we’re not financially competitive with Facebook, but ... ”. I’ve heard from multiple founders that joining as an early employee is an incredibly bad deal when you compare early-employee equity and workload vs. founder equity and workload.

Some startups are giving out offers that are actually competitive with large company offers. Something I’ve seen from startups that are trying to give out compelling offers is that, for “senior” folks, they’re willing to pay substantially higher salaries than public companies because it’s understood that options aren’t great for employees because of their timeline, risk profile, and expected value.

There’s a huge amount of variation in offers, much of which is effectively random. I know of cases where an individual got a more lucrative offer from a startup (that doesn’t tend to give particular strong offers) than from Google, and if you ask around you’ll hear about a lot of cases like that. It’s not always true that startup offers are lower than Google/Facebook/Amazon offers, even at startups that don’t pay competitively (on average).

Anything in this post that’s related to taxes is U.S. specific. For example, I’m told that in Canada, “you can defer the payment of taxes when exercising options whose strike price is way below fair market valuation until disposition, as long as the company is Canadian-controlled and operated in Canada”.

You might object that the same line of reasoning we looked at for options can be applied to RSUs, even RSUs for public companies. That’s true, although the largest downsides of startup options are mitigated or non-existent, cash still has significant advantages to employees over RSUs. Unfortunately, the only non-finance company I know of that uses this to their advantage in recruiting is Netflix; please let me know if you can think of other tech companies that use the same compensation model.

Some startups have a sliding scale that lets you choose different amounts of option/salary compensation. I haven't seen an offer that will let you put the slider to 100% cash and 0% options (or 100% options and 0% cash), but someone out there will probably be willing to give you an all-cash offer.

In the current environment, looking at public exits may bias the data towards less sucessful companies. The most sucessful startups from the last couple generations of startups that haven't exited by acquisition have so far chosen not to IPO. It's possible that, once all the data are in, the average returns to joining a startup will look quite different (although I doubt the median return will change much).

BTW, I don't have anything against taking a startup offer, even if it's low. When I graduated from college, I took the lowest offer I had, and my partner recently took the lowest offer she got (nearly a 2x difference over the highest offer). There are plenty of reasons you might want to take an offer that isn't the best possible financial offer. However, I think you should know what you're getting into and not take an offer that you think is financially great when it's merely mediocre or even bad.

Appendix: non-counterarguments

The most common objection I’ve heard to this is that most startups don’t have enough money to pay equivalent cash and couldn’t raise that much money by selling off what would “normally” be their employee option pool. Maybe so, but that’s not a counter-argument — it’s an argument that the most startups don’t have options that are valuable enough to be exchanged for the equivalent sum of money, i.e., that the options simply aren’t as valuable as claimed. This argument can be phrased in a variety of ways (e.g., paying salary instead of options increases burn rate, reduces runway, makes the startup default dead, etc.), but arguments of this form are fundamentally equivalent to admitting that startup options aren’t worth much because they wouldn't hold up if the options were worth enough that a typical compensation package was worth as much as a typical "senior" offer at Google or Facebook.

If you don't buy this, imagine a startup with a typical valuation that's at a stage where they're giving out 0.1% equity in options to new hires. Now imagine that some irrational bystander is willing to make a deal where they take 0.1% of the company for $1B. Is it worth it to take the money and pay people out of the $1B cash pool instead of paying people with 0.1% slices of the option pool? Your answer should be yes, unless you believe that the ratio between the value of cash on hand and equity is nearly infinite. Absolute statements like "options are preferred to cash because paying cash increases burn rate, making the startup default dead" at any valuation are equivalent to stating that the correct ratio is infinity. That's clearly nonsensical; there's some correct ratio, and we might disagree over what the correct ratio is, but for typical startups it should not be the case that the correct ratio is infinite. Since this was such a common objection, if you have this objection, my question to you is, why don't you argue that startups should pay even less cash and even more options? Is the argument that the current ratio is exactly optimal, and if so, why? Also, why does the ratio vary so much between different companies at the same stage which have raised roughly the same amount of money? Are all of those companies giving out optimal deals?

The second most common objection is that startup options are actually worth a lot, if you pick the right startup and use a proper model to value the options. Perhaps, but if that’s true, why couldn’t they have raised a bit more money by giving away more equity to VCs at its true value, and then pay cash?

Another common objection is something like "I know lots of people who've made $1m from startups". Me too, but I also know lots of people who've made much more than that working at public companies. This post is about the relative value of compensation packages, not the absolute value.

Acknowledgements

Thanks to Leah Hanson, Ben Kuhn, Tim Abbott, David Turner, Nick Bergson-Shilcock, Peter Fraenkel, Joe Ardent, Chris Ball, Anton Dubrau, Sean Talts, Danielle Sucher, Dan McKinley, Bert Muthalaly, Dan Puttick, Indradhanush Gupta, and Gaxun for comments and corrections.

How web bloat impacts users with slow connections

2017-02-08 08:00:00

A couple years ago, I took a road trip from Wisconsin to Washington and mostly stayed in rural hotels on the way. I expected the internet in rural areas too sparse to have cable internet to be slow, but I was still surprised that a large fraction of the web was inaccessible. Some blogs with lightweight styling were readable, as were pages by academics who hadn’t updated the styling on their website since 1995. But very few commercial websites were usable (other than Google). When I measured my connection, I found that the bandwidth was roughly comparable to what I got with a 56k modem in the 90s. The latency and packetloss were significantly worse than the average day on dialup: latency varied between 500ms and 1000ms and packetloss varied between 1% and 10%. Those numbers are comparable to what I’d see on dialup on a bad day.

Despite my connection being only a bit worse than it was in the 90s, the vast majority of the web wouldn’t load. Why shouldn’t the web work with dialup or a dialup-like connection? It would be one thing if I tried to watch youtube and read pinterest. It’s hard to serve videos and images without bandwidth. But my online interests are quite boring from a media standpoint. Pretty much everything I consume online is plain text, even if it happens to be styled with images and fancy javascript. In fact, I recently tried using w3m (a terminal-based web browser that, by default, doesn’t support css, javascript, or even images) for a week and it turns out there are only two websites I regularly visit that don’t really work in w3m (twitter and zulip, both fundamentally text based sites, at least as I use them)1.

More recently, I was reminded of how poorly the web works for people on slow connections when I tried to read a joelonsoftware post while using a flaky mobile connection. The HTML loaded but either one of the five CSS requests or one of the thirteen javascript requests timed out, leaving me with a broken page. Instead of seeing the article, I saw three entire pages of sidebar, menu, and ads before getting to the title because the page required some kind of layout modification to display reasonably. Pages are often designed so that they're hard or impossible to read if some dependency fails to load. On a slow connection, it's quite common for at least one depedency to fail. After refreshing the page twice, the page loaded as it was supposed to and I was able to read the blog post, a fairly compelling post on eliminating dependencies.

Complaining that people don’t care about performance like they used to and that we’re letting bloat slow things down for no good reason is “old man yells at cloud” territory; I probably sound like that dude who complains that his word processor, which used to take 1MB of RAM, takes 1GB of RAM. Sure, that could be trimmed down, but there’s a real cost to spending time doing optimization and even a $300 laptop comes with 2GB of RAM, so why bother? But it’s not quite the same situation -- it’s not just nerds like me who care about web performance. When Microsoft looked at actual measured connection speeds, they found that half of Americans don't have broadband speed. Heck, AOL had 2 million dial-up subscribers in 2015, just AOL alone. Outside of the U.S., there are even more people with slow connections. I recently chatted with Ben Kuhn, who spends a fair amount of time in Africa, about his internet connection:

I've seen ping latencies as bad as ~45 sec and packet loss as bad as 50% on a mobile hotspot in the evenings from Jijiga, Ethiopia. (I'm here now and currently I have 150ms ping with no packet loss but it's 10am). There are some periods of the day where it ~never gets better than 10 sec and ~10% loss. The internet has gotten a lot better in the past ~year; it used to be that bad all the time except in the early mornings.

Speedtest.net reports 2.6 mbps download, 0.6 mbps upload. I realized I probably shouldn't run a speed test on my mobile data because bandwidth is really expensive.

Our server in Ethiopia is has a fiber uplink, but it frequently goes down and we fall back to a 16kbps satellite connection, though I think normal people would just stop using the Internet in that case.

If you think browsing on a 56k connection is bad, try a 16k connection from Ethiopia!

Everything we’ve seen so far is anecdotal. Let’s load some websites that programmers might frequent with a variety of simulated connections to get data on page load times. webpagetest lets us see how long it takes a web site to load (and why it takes that long) from locations all over the world. It even lets us simulate different kinds of connections as well as load sites on a variety of mobile devices. The times listed in the table below are the time until the page is “visually complete”; as measured by webpagetest, that’s the time until the above-the-fold content stops changing.

URL Size C Load time in seconds
MB FIOS Cable LTE 3G 2G Dial Bad 😱
0 http://bellard.org 0.01 5 0.40 0.59 0.60 1.2 2.9 1.8 9.5 7.6
1 http://danluu.com 0.02 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
2 news.ycombinator.com 0.03 1 0.30 0.49 0.69 1.6 5.5 5.0 14 27
3 danluu.com 0.03 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15
4 http://jvns.ca 0.14 7 0.49 0.69 1.2 2.9 10 19 29 108
5 jvns.ca 0.15 4 0.50 0.80 1.2 3.3 11 21 31 97
6 fgiesen.wordpress.com 0.37 12 1.0 1.1 1.4 5.0 16 66 68 FAIL
7 google.com 0.59 6 0.80 1.8 1.4 6.8 19 94 96 236
8 joelonsoftware.com 0.72 19 1.3 1.7 1.9 9.7 28 140 FAIL FAIL
9 bing.com 1.3 12 1.4 2.9 3.3 11 43 134 FAIL FAIL
10 reddit.com 1.3 26 7.5 6.9 7.0 20 58 179 210 FAIL
11 signalvnoise.com 2.1 7 2.0 3.5 3.7 16 47 173 218 FAIL
12 amazon.com 4.4 47 6.6 13 8.4 36 65 265 300 FAIL
13 steve-yegge.blogspot.com 9.7 19 2.2 3.6 3.3 12 36 206 188 FAIL
14 blog.codinghorror.com 23 24 6.5 15 9.5 83 235 FAIL FAIL FAIL

Each row is a website. For sites that support both plain HTTP as well as HTTPS, both were tested; URLs are HTTPS except where explicitly specified as HTTP. The first two columns show the amount of data transferred over the wire in MB (which includes headers, handshaking, compression, etc.) and the number of TCP connections made. The rest of the columns show the time in seconds to load the page on a variety of connections from fiber (FIOS) to less good connections. “Bad” has the bandwidth of dialup, but with 1000ms ping and 10% packetloss, which is roughly what I saw when using the internet in small rural hotels. “😱” simulates a 16kbps satellite connection from Jijiga, Ethiopia. Rows are sorted by the measured amount of data transferred.

The timeout for tests was 6 minutes; anything slower than that is listed as FAIL. Pages that failed to load are also listed as FAIL. A few things that jump out from the table are:

  1. A large fraction of the web is unusable on a bad connection. Even on a good (0% packetloss, no ping spike) dialup connection, some sites won’t load.
  2. Some sites will use a lot of data!

The web on bad connections

As commercial websites go, Google is basically as good as it gets for people on a slow connection. On dialup, the 50%-ile page load time is a minute and a half. But at least it loads -- when I was on a slow, shared, satellite connection in rural Montana, virtually no commercial websites would load at all. I could view websites that only had static content via Google cache, but the live site had no hope of loading.

Some sites will use a lot of data

Although only two really big sites were tested here, there are plenty of sites that will use 10MB or 20MB of data. If you’re reading this from the U.S., maybe you don’t care, but if you’re browsing from Mauritania, Madagascar, or Vanuatu, loading codinghorror once will cost you more than 10% of the daily per capita GNI.

Page weight matters

Despite the best efforts of Maciej, the meme that page weight doesn’t matter keeps getting spread around. AFAICT, the top HN link of all time on web page optimization is to an article titled “Ludicrously Fast Page Loads - A Guide for Full-Stack Devs”. At the bottom of the page, the author links to another one of his posts, titled “Page Weight Doesn’t Matter”.

Usually, the boogeyman that gets pointed at is bandwidth: users in low-bandwidth areas (3G, developing world) are getting shafted. But the math doesn’t quite work out. Akamai puts the global connection speed average at 3.9 megabits per second.

The “ludicrously fast” guide fails to display properly on dialup or slow mobile connections because the images time out. On reddit, it also fails under load: "Ironically, that page took so long to load that I closed the window.", "a lot of … gifs that do nothing but make your viewing experience worse", "I didn't even make it to the gifs; the header loaded then it just hung.", etc.

The flaw in the “page weight doesn’t matter because average speed is fast” is that if you average the connection of someone in my apartment building (which is wired for 1Gbps internet) and someone on 56k dialup, you get an average speed of 500 Mbps. That doesn’t mean the person on dialup is actually going to be able to load a 5MB website. The average speed of 3.9 Mbps comes from a 2014 Akamai report, but it’s just an average. If you look at Akamai’s 2016 report, you can find entire countries where more than 90% of IP addresses are slower than that!

Yes, there are a lot of factors besides page weight that matter, and yes it's possible to create a contrived page that's very small but loads slowly, as well as a huge page that loads ok because all of the weight isn't blocking, but total page weight is still pretty decently correlated with load time.

Since its publication, the "ludicrously fast" guide was updated with some javascript that only loads images if you scroll down far enough. That makes it look a lot better on webpagetest if you're looking at the page size number (if webpagetest isn't being scripted to scroll), but it's a worse user experience for people on slow connections who want to read the page. If you're going to read the entire page anyway, the weight increases, and you can no longer preload images by loading the site. Instead, if you're reading, you have to stop for a few minutes at every section to wait for the images from that section to load. And that's if you're lucky and the javascript for loading images didn't fail to load.

The average user fallacy

Just like many people develop with an average connection speed in mind, many people have a fixed view of who a user is. Maybe they think there are customers with a lot of money with fast connections and customers who won't spend money on slow connections. That is, very roughly speaking, perhaps true on average, but sites don't operate on average, they operate in particular domains. Jamie Brandon writes the following about his experience with Airbnb:

I spent three hours last night trying to book a room on airbnb through an overloaded wifi and presumably a satellite connection. OAuth seems to be particularly bad over poor connections. Facebook's OAuth wouldn't load at all and Google's sent me round a 'pick an account' -> 'please reenter you password' -> 'pick an account' loop several times. It took so many attempts to log in that I triggered some 2fa nonsense on airbnb that also didn't work (the confirmation link from the email led to a page that said 'please log in to view this page') and eventually I was just told to send an email to [email protected], who haven't replied.

It's particularly galling that airbnb doesn't test this stuff, because traveling is pretty much the whole point of the site so they can't even claim that there's no money in servicing people with poor connections.

What about tail latency?

My original plan for this was post was to show 50%-ile, 90%-ile, 99%-ile, etc., tail load times. But the 50%-ile results are so bad that I don’t know if there’s any point to showing the other results. If you were to look at the 90%-ile results, you’d see that most pages fail to load on dialup and the “Bad” and “😱” connections are hopeless for almost all sites.

HTTP vs HTTPs

URL Size C Load time in seconds
kB FIOS Cable LTE 3G 2G Dial Bad 😱
1 http://danluu.com 21.1 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
3 https://danluu.com 29.3 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15

You can see that for a very small site that doesn’t load many blocking resources, HTTPS is noticeably slower than HTTP, especially on slow connections. Practically speaking, this doesn’t matter today because virtually no sites are that small, but if you design a web site as if people with slow connections actually matter, this is noticeable.

How to make pages usable on slow connections

The long version is, to really understand what’s going on, considering reading high-performance browser networking, a great book on web performance that’s avaiable for free.

The short version is that most sites are so poorly optimized that someone who has no idea what they’re doing can get a 10x improvement in page load times for a site whose job is to serve up text with the occasional image. When I started this blog in 2013, I used Octopress because Jekyll/Octopress was the most widely recommended static site generator back then. A plain blog post with one or two images took 11s to load on a cable connection because the Octopress defaults included multiple useless javascript files in the header (for never-used-by-me things like embedding flash videos and delicious integration), which blocked page rendering. Just moving those javascript includes to the footer halved page load time, and making a few other tweaks decreased page load time by another order of magnitude. At the time I made those changes, I knew nothing about web page optimization, other than what I heard during a 2-minute blurb on optimization from a 40-minute talk on how the internet works and I was able to get a 20x speedup on my blog in a few hours. You might argue that I’ve now gone too far and removed too much CSS, but I got a 20x speedup for people on fast connections before making changes that affected the site’s appearance (and the speedup on slow connections was much larger).

That’s normal. Popular themes for many different kinds of blogging software and CMSs contain anti-optimizations so blatant that any programmer, even someone with no front-end experience, can find large gains by just pointing webpagetest at their site and looking at the output.

What about browsers?

While it's easy to blame page authors because there's a lot of low-hanging fruit on the page side, there's just as much low-hanging fruit on the browser side. Why does my browser open up 6 TCP connections to try to download six images at once when I'm on a slow satellite connection? That just guarantees that all six images will time out! Even if I tweak the timeout on the client side, servers that are configured to protect against DoS attacks won't allow long lived connections that aren't doing anything. I can sometimes get some images to load by refreshing the page a few times (and waiting ten minutes each time), but why shouldn't the browser handle retries for me? If you think about it for a few minutes, there are a lot of optimiztions that browsers could do for people on slow connections, but because they don't, the best current solution for users appears to be: use w3m when you can, and then switch to a browser with ad-blocking when that doesn't work. But why should users have to use two entirely different programs, one of which has a text-based interface only computer nerds will find palatable?

Conclusion

When I was at Google, someone told me a story about a time that “they” completed a big optimization push only to find that measured page load times increased. When they dug into the data, they found that the reason load times had increased was that they got a lot more traffic from Africa after doing the optimizations. The team’s product went from being unusable for people with slow connections to usable, which caused so many users with slow connections to start using the product that load times actually increased.

Last night, at a presentation on the websockets protocol, Gary Bernhardt made the observation that the people who designed the websockets protocol did things like using a variable length field for frame length to save a few bytes. By contrast, if you look at the Alexa top 100 sites, almost all of them have a huge amount of slop in them; it’s plausible that the total bandwidth used for those 100 sites is probably greater than the total bandwidth for all websockets connections combined. Despite that, if we just look at the three top 35 sites tested in this post, two send uncompressed javascript over the wire, two redirect the bare domain to the www subdomain, and two send a lot of extraneous information by not compressing images as much as they could be compressed without sacrificing quality. If you look at twitter, which isn’t in our table but was mentioned above, they actually do an anti-optimization where, if you upload a PNG which isn’t even particularly well optimized, they’ll re-encode it as a jpeg which is larger and has visible artifacts!

“Use bcrypt” has become the mantra for a reasonable default if you’re not sure what to do when storing passwords. The web would be a nicer place if “use webpagetest” caught on in the same way. It’s not always the best tool for the job, but it sure beats the current defaults.

Appendix: experimental caveats

The above tests were done by repeatedly loading pages via a private webpagetest image in AWS west 2, on a c4.xlarge VM, with simulated connections on a first page load in Chrome with no other tabs open and nothing running on the VM other than the webpagetest software and the browser. This is unrealistic in many ways.

In relative terms, this disadvantages sites that have a large edge presence. When I was in rural Montana, I ran some tests and found that I had noticeably better latency to Google than to basically any other site. This is not reflected in the test results. Furthermore, this setup means that pages are nearly certain to be served from a CDN cache. That shouldn't make any difference for sites like Google and Amazon, but it reduces the page load time of less-trafficked sites that aren't "always" served out of cache. For example, when I don't have a post trending on social media, between 55% and 75% of traffic is served out of a CDN cache, and when I do have something trending on social media, it's more like 90% to 99%. But the test setup means that the CDN cache hit rate during the test is likely to be > 99% for my site and other blogs which aren't so widely read that they'd normally always have a cached copy available.

All tests were run assuming a first page load, but it’s entirely reasonable for sites like Google and Amazon to assume that many or most of their assets are cached. Testing first page load times is perhaps reasonable for sites with a traffic profile like mine, where much of the traffic comes from social media referrals of people who’ve never visited the site before.

A c4.xlarge is a fairly powerful machine. Today, most page loads come from mobile and even the fastest mobile devices aren’t as fast as a c4.xlarge; most mobile devices are much slower than the fastest mobile devices. Most desktop page loads will also be from a machine that’s slower than a c4.xlarge. Although the results aren’t shown, I also ran a set of tests using a t2.micro instance: for simple sites, like mine, the difference was negligible, but for complex sites, like Amazon, page load times were as much as 2x worse. As you might expect, for any particular site, the difference got smaller as the connection got slower.

As Joey Hess pointed out, many dialup providers attempt to do compression or other tricks to reduce the effective weight of pages and none of these tests take that into account.

Firefox, IE, and Edge often have substantially different performance characteristics from Chrome. For that matter, different versions of Chrome can have different performance characteristics. I just used Chrome because it’s the most widely used desktop browser, and running this set of tests took over a full day of VM time with a single-browser.

The simulated bad connections add a constant latency and fixed (10%) packetloss. In reality, poor connections have highly variable latency with peaks that are much higher than the simulated latency and periods of much higher packetloss than can last for minutes, hours, or days. Putting 😱 at the rightmost side of the table may make it seem like the worst possible connection, but packetloss can get much worse.

Similarly, while codinghorror happens to be at the bottom of the page, it's nowhere to being the slowest loading page. Just for example, I originally considered including slashdot in the table but it was so slow that it caused a significant increase in total test run time because it timed out at six minutes so many times. Even on FIOS it takes 15s to load by making a whopping 223 requests over 100 TCP connections despite weighing in at "only" 1.9MB. Amazingly, slashdot also pegs the CPU at 100% for 17 entire seconds while loading on FIOS. In retrospect, this might have been a good site to include because it's pathologically mis-optimized sites like slashdot that allow the "page weight doesn't matter" meme to sound reasonable.

The websites compared don't do the same thing. Just looking at the blogs, some blogs put entire blog entries on the front page, which is more convenient in some ways, but also slower. Commercial sites are even more different -- they often can't reasonably be static sites and have to have relatively large javascrit payloads in order to work well.

Appendix: irony

The main table in this post is almost 50kB of HTML (without compression or minification); that’s larger than everything else in this post combined. That table is curiously large because I used a library (pandas) to generate the table instead of just writing a script to do it by hand, and as we know, the default settings for most libraries generate a massive amount of bloat. It didn’t even save time because every single built-in time-saving feature that I wanted to use was buggy, which forced me to write all of the heatmap/gradient/styling code myself anyway! Due to laziness, I left the pandas table generating scaffolding code, resulting in a table that looks like it’s roughly an order of magnitude larger than it needs to be.

This isn't a criticism of pandas. Pandas is probably quite good at what it's designed for; it's just not designed to produce slim websites. The CSS class names are huge, which is reasonable if you want to avoid accidental name collisions for generated CSS. Almost every td, th, and tr element is tagged with a redundant rowspan=1 or colspan=1, which is reasonable for generated code if you don't care about size. Each cell has its own CSS class, even though many cells share styling with other cells; again, this probably simplified things on the code generation. Every piece of bloat is totally reasonable. And unfortunately, there's no tool that I know of that will take a bloated table and turn it into a slim table. A pure HTML minifier can't change the class names because it doesn't know that some external CSS or JS doesn't depend on the class name. An HTML minifier could theoretically determine that different cells have the same styling and merge them, except for the aforementioned problem with potential but non-existent external depenencies, but that's beyond the capability of the tools I know of.

For another level of ironic, consider that while I think of a 50kB table as bloat, this page is 12kB when gzipped, even with all of the bloat. Google's AMP currently has > 100kB of blocking javascript that has to load before the page loads! There's no reason for me to use AMP pages because AMP is slower than my current setup of pure HTML with a few lines of embedded CSS and the occasional image, but, as a result, I'm penalized by Google (relative to AMP pages) for not "accelerating" (deccelerating) my page with AMP.

Thanks to Leah Hanson, Jason Owen, Ethan Willis, and Lindsey Kuper for comments/corrections


  1. excluding internal Microsoft stuff that’s required for work. Many of the sites are IE only and don’t even work in edge. I didn’t try those sites in w3m but I doubt they’d work! In fact, I doubt that even half of the non-IE specific internal sites would work in w3m. [return]

HN: the good parts

2016-10-23 08:00:00

HN comments are terrible. On any topic I’m informed about, the vast majority of comments are pretty clearly wrong. Most of the time, there are zero comments from people who know anything about the topic and the top comment is reasonable sounding but totally incorrect. Additionally, many comments are gratuitously mean. You'll often hear mean comments backed up with something like "this is better than the other possibility, where everyone just pats each other on the back with comments like 'this is great'", as if being an asshole is some sort of talisman against empty platitudes. I've seen people push back against that; when pressed, people often say that it’s either impossible or inefficient to teach someone without being mean, as if telling someone that they're stupid somehow helps them learn. It's as if people learned how to explain things by watching Simon Cowell and can't comprehend the concept of an explanation that isn't littered with personal insults. Paul Graham has said, "Oh, you should never read Hacker News comments about anything you write”. Most of the negative things you hear about HN comments are true.

And yet, I haven’t found a public internet forum with better technical commentary. On topics I'm familiar with, while it's rare that a thread will have even a single comment that's well-informed, when those comments appear, they usually float to the top. On other forums, well-informed comments are either non-existent or get buried by reasonable sounding but totally wrong comments when they appear, and they appear even more rarely than on HN.

By volume, there are probably more interesting technical “posts” in comments than in links. Well, that depends on what you find interesting, but that’s true for my interests. If I see a low-level optimization comment from nkurz, a comment on business from patio11, a comment on how companies operate by nostrademons, I almost certainly know that I’m going to read an interesting comment. There are maybe 20 to 30 people I can think of who don’t blog much, but write great comments on HN and I doubt I even know of half the people who are writing great comments on HN1.

I compiled a very abbreviated list of comments I like because comments seem to get lost. If you write a blog post, people will refer it years later, but comments mostly disappear. I think that’s sad -- there’s a lot of great material on HN (and yes, even more not-so-great material).

What’s the deal with MS Word’s file format?

Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn't know any better.

Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy workarounds for things that would be simple if you allowed yourself to redesign the file format. It's pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.

Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That's because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word -- only the file format doesn't tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like "Format like Word 95". The format doesn't say what that is -- because it's pretty obvious that the authors of the file format don't know. It's lost in a hopeless mess of legacy code and nobody can figure out what it does now.

Fun with NULL

Here's another example of this fine feature:

  #include <stdio.h>
  #include <string.h>
  #include <stdlib.h>
  #define LENGTH 128

  int main(int argc, char **argv) {
      char *string = NULL;
      int length = 0;
      if (argc > 1) {
          string = argv[1];
          length = strlen(string);
          if (length >= LENGTH) exit(1);
      }

      char buffer[LENGTH];
      memcpy(buffer, string, length);
      buffer[length] = 0;

      if (string == NULL) {
          printf("String is null, so cancel the launch.\n");
      } else {
          printf("String is not null, so launch the missiles!\n");
      }

      printf("string: %s\n", string);  // undefined for null but works in practice

      #if SEGFAULT_ON_NULL
      printf("%s\n", string);          // segfaults on null when bare "%s\n"
      #endif

      return 0;
  }

  nate@skylake:~/src$ clang-3.8 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ icc-17 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ gcc-5 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is not null, so launch the missiles!
  string: (null)

It appear that Intel's ICC and Clang still haven't caught up with GCC's optimizations. Ouch if you were depending on that optimization to get the performance you need! But before picking on GCC too much, consider that all three of those compilers segfault on printf("string: "); printf("%s\n", string) when string is NULL, despite having no problem with printf("string: %s\n", string) as a single statement. Can you see why using two separate statements would cause a segfault? If not, see here for a hint: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25609

How do you make sure the autopilot backup is paying attention?

Good engineering eliminates users being able to do the wrong thing as much as possible. . . . You don't design a feature that invites misuse and then use instructions to try to prevent that misuse.

There was a derailment in Australia called the Waterfall derailment [1]. It occurred because the driver had a heart attack and was responsible for 7 deaths (a miracle it was so low, honestly). The root cause was the failure of the dead-man's switch.

In the case of Waterfall, the driver had 2 dead-man switches he could use - 1) the throttle handle had to be held against a spring at a small rotation, or 2) a bar on the floor could be depressed. You had to do 1 of these things, the idea being that you prevent wrist or foot cramping by allowing the driver to alternate between the two. Failure to do either triggers an emergency brake.

It turns out that this driver was fat enough that when he had a heart attack, his leg was able to depress the pedal enough to hold the emergency system off. Thus, the dead-man's system never triggered with a whole lot of dead man in the driver's seat.

I can't quite remember the specifics of the system at Waterfall, but one method to combat this is to require the pedal to be held halfway between released and fully depressed. The idea being that a dead leg would fully depress the pedal so that would trigger a brake, and a fully released pedal would also trigger a brake. I don't know if they had that system but certainly that's one approach used in rail.

Either way, the problem is equally possible in cars. If you lose consciousness and your foot goes limp, a heavy enough leg will be able to hold the pedal down a bit depending on where it's positioned relative to the pedal and the leverage it has on the floor.

The other major system I'm familiar with for ensuring drivers are alive at the helm is called 'vigilance'. The way it works is that periodically, a light starts flashing on the dash and the driver has to acknowledge that. If they do not, a buzzer alarm starts sounding. If they still don't acknowledge it, the train brakes apply and the driver is assumed incapacitated. Let me tell you some stories of my involvement in it.

When we first started, we had a simple vigi system. Every 30 seconds or so (for example), the driver would press a button. Ok cool. Except that then drivers became so hard-wired to pressing the button every 30 seconds that we were having instances of drivers falling asleep/dozing off and still pressing the button right on every 30 seconds because it was so ingrained into them that it was literally a subconscious action.

So we introduced random-timing vigilance, where the time varies 30-60 seconds (for example) and you could only acknowledge it within a small period of time once the light started flashing. Again, drivers started falling asleep/semi asleep and would hit it as soon as the alarm buzzed, each and every time.

So we introduced random-timing, task-linked vigilance and that finally broke the back of the problem. Now, the driver has to press a button, or turn a knob, or do a number of different activities and they must do that randomly-chosen activity, at a randomly-chosen time, for them to acknowledge their consciousness. It was only at that point that we finally nailed out driver alertness.

See also.

Prestige

Curious why he would need to move to a more prestigious position? Most people realize by their 30s that prestige is a sucker's game; it's a way of inducing people to do things that aren't much fun and they wouldn't really want to do on their own, by lauding them with accolades from people they don't really care about.

Why is FedEx based in Mephis?

. . . we noticed that we also needed:
(1) A suitable, existing airport at the hub location.
(2) Good weather at the hub location, e.g., relatively little snow, fog, or rain.
(3) Access to good ramp space, that is, where to park and service the airplanes and sort the packages.
(4) Good labor supply, e.g., for the sort center.
(5) Relatively low cost of living to keep down prices.
(6) Friendly regulatory environment.
(7) Candidate airport not too busy, e.g., don't want arriving planes to have to circle a long time before being able to land.
(8) Airport with relatively little in cross winds and with more than one runway to pick from in case of winds.
(9) Runway altitude not too high, e.g., not high enough to restrict maximum total gross take off weight, e.g., rule out Denver.
(10) No tall obstacles, e.g., mountains, near the ends of the runways.
(11) Good supplies of jet fuel.
(12) Good access to roads for 18 wheel trucks for exchange of packages between trucks and planes, e.g., so that some parts could be trucked to the hub and stored there and shipped directly via the planes to customers that place orders, say, as late as 11 PM for delivery before 10 AM.
So, there were about three candidate locations, Memphis and, as I recall, Cincinnati and Kansas City.
The Memphis airport had some old WWII hangers next to the runway that FedEx could use for the sort center, aircraft maintenance, and HQ office space. Deal done -- it was Memphis.

Why etherpad joined Wave, and why it didn’t work out as expected

The decision to sell to Google was one of the toughest decisions I and my cofounders ever had to wrestle with in our lives. We were excited by the Wave vision though we saw the flaws in the product. The Wave team told us about how they wanted our help making wave simpler and more like etherpad, and we thought we could help with that, though in the end we were unsuccessful at making wave simpler. We were scared of Google as a competitor: they had more engineers and more money behind this project, yet they were running it much more like an independent startup than a normal big-company department. The Wave office was in Australia and had almost total autonomy. And finally, after 1.5 years of being on the brink of failure with AppJet, it was tempting to be able to declare our endeavor a success and provide a decent return to all our investors who had risked their money on us.

In the end, our decision to join Wave did not work out as we had hoped. The biggest lessons learned were that having more engineers and money behind a project can actually be more harmful than helpful, so we were wrong to be scared of Wave as a competitor for this reason. It seems obvious in hindsight, but at the time it wasn't. Second, I totally underestimated how hard it would be to iterate on the Wave codebase. I was used to rewriting major portions of software in a single all-nighter. Because of the software development process Wave was using, it was practically impossible to iterate on the product. I should have done more diligence on their specific software engineering processes, but instead I assumed because they seemed to be operating like a startup, that they would be able to iterate like a startup. A lot of the product problems were known to the whole Wave team, but we were crippled by a large complex codebase built on poor technical choices and a cumbersome engineering process that prevented fast iteration.

The accuracy of tech news

When I've had inside information about a story that later breaks in the tech press, I'm always shocked at how differently it's perceived by readers of the article vs. how I experienced it. Among startups & major feature launches I've been party to, I've seen: executives that flat-out say that they're not working on a product category when there's been a whole department devoted to it for a year; startups that were founded 1.5 years before the dates listed in Crunchbase/Wikipedia; reporters that count the number of people they meet in a visit and report that as a the "team size", because the company refuses to release that info; funding rounds that never make it to the press; acquisitions that are reported as "for an undisclosed sum" but actually are less than the founders would've made if they'd taken a salaried job at the company; project start dates that are actually when the project was staffed up to its current size and ignore the year or so that a small team spent working on the problem (or the 3-4 years that other small teams spent working on the problem); and algorithms or other technologies that are widely reported as being the core of the company's success, but actually aren't even used by the company.

Self-destructing speakers from Dell

As the main developer of VLC, we know about this story since a long time, and this is just Dell putting crap components on their machine and blaming others. Any discussion was impossible with them. So let me explain a bit...

In this case, VLC just uses the Windows APIs (DirectSound), and sends signed integers of 16bits (s16) to the Windows Kernel.

VLC allows amplification of the INPUT above the sound that was decoded. This is just like replay gain, broken codecs, badly recorded files or post-amplification and can lead to saturation.

But this is exactly the same if you put your mp3 file through Audacity and increase it and play with WMP, or if you put a DirectShow filter that amplifies the volume after your codec output. For example, for a long time, VLC ac3 and mp3 codecs were too low (-6dB) compared to the reference output.

At worse, this will reduce the dynamics and saturate a lot, but this is not going to break your hardware.

VLC does not (and cannot) modify the OUTPUT volume to destroy the speakers. VLC is a Software using the OFFICIAL platforms APIs.

The issue here is that Dell sound cards output power (that can be approached by a factor of the quadratic of the amplitude) that Dell speakers cannot handle. Simply said, the sound card outputs at max 10W, and the speakers only can take 6W in, and neither their BIOS or drivers block this.

And as VLC is present on a lot of machines, it's simple to blame VLC. "Correlation does not mean causation" is something that seems too complex for cheap Dell support…

Learning on the job, startups vs. big companies

Working for someone else's startup, I learned how to quickly cobble solutions together. I learned about uncertainty and picking a direction regardless of whether you're sure it'll work. I learned that most startups fail, and that when they fail, the people who end up doing well are the ones who were looking out for their own interests all along. I learned a lot of basic technical skills, how to write code quickly and learn new APIs quickly and deploy software to multiple machines. I learned how quickly problems of scaling a development team crop up, and how early you should start investing in automation.

Working for Google, I learned how to fix problems once and for all and build that culture into the organization. I learned that even in successful companies, everything is temporary, and that great products are usually built through a lot of hard work by many people rather than great ah-ha insights. I learned how to architect systems for scale, and a lot of practices used for robust, high-availability, frequently-deployed systems. I learned the value of research and of spending a lot of time on a single important problem: many startups take a scattershot approach, trying one weekend hackathon after another and finding nobody wants any of them, while oftentimes there are opportunities that nobody has solved because nobody wants to put in the work. I learned how to work in teams and try to understand what other people want. I learned what problems are really painful for big organizations. I learned how to rigorously research the market and use data to make product decisions, rather than making decisions based on what seems best to one person.

We failed this person, what are we going to do differently?

Having been in on the company's leadership meetings where departures were noted with a simple 'regret yes/no' flag it was my experience that no single departure had any effect. Mass departures did, trends did, but one person never did, even when that person was a founder.

The rationalizations always put the issue back on the departing employee, "They were burned out", "They had lost their ability to be effective", "They have moved on", "They just haven't grown with the company" never was it "We failed this person, what are we going to do differently?"

AWS’s origin story

Anyway, the SOA effort was in full swing when I was there. It was a pain, and it was a mess because every team did things differently and every API was different and based on different assumptions and written in a different language.

But I want to correct the misperception that this lead to AWS. It didn't. S3 was written by its own team, from scratch. At the time I was at Amazon, working on the retail site, none of Amazon.com was running on AWS. I know, when AWS was announced, with great fanfare, they said "the services that power Amazon.com can now power your business!" or words to that effect. This was a flat out lie. The only thing they shared was data centers and a standard hardware configuration. Even by the time I left, when AWS was running full steam ahead (and probably running Reddit already), none of Amazon.com was running on AWS, except for a few, small, experimental and relatively new projects. I'm sure more of it has been adopted now, but AWS was always a separate team (and a better managed one, from what I could see.)

Why is Windows so slow?

I (and others) have put a lot of effort into making the Linux Chrome build fast. Some examples are multiple new implementations of the build system (http://neugierig.org/software/chromium/notes/2011/02/ninja.h... ), experimentation with the gold linker (e.g. measuring and adjusting the still off-by-default thread flags https://groups.google.com/a/chromium.org/group/chromium-dev/... ) as well as digging into bugs in it, and other underdocumented things like 'thin' ar archives.

But it's also true that people who are more of Windows wizards than I am a Linux apprentice have worked on Chrome's Windows build. If you asked me the original question, I'd say the underlying problem is that on Windows all you have is what Microsoft gives you and you can't typically do better than that. For example, migrating the Chrome build off of Visual Studio would be a large undertaking, large enough that it's rarely considered. (Another way of phrasing this is it's the IDE problem: you get all of the IDE or you get nothing.)

When addressing the poor Windows performance people first bought SSDs, something that never even occurred to me ("your system has enough RAM that the kernel cache of the file system should be in memory anyway!"). But for whatever reason on the Linux side some Googlers saw it fit to rewrite the Linux linker to make it twice as fast (this effort predated Chrome), and all Linux developers now get to benefit from that. Perhaps the difference is that when people write awesome tools for Windows or Mac they try to sell them rather than give them away.

Why is Windows so slow, an insider view

I'm a developer in Windows and contribute to the NT kernel. (Proof: the SHA1 hash of revision #102 of [Edit: filename redacted] is [Edit: hash redacted].) I'm posting through Tor for obvious reasons.

Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening. The cause of the problem is social. There's almost none of the improvement for its own sake, for the sake of glory, that you see in the Linux world.

Granted, occasionally one sees naive people try to make things better. These people almost always fail. We can and do improve performance for specific scenarios that people with the ability to allocate resources believe impact business goals, but this work is Sisyphean. There's no formal or informal program of systemic performance improvement. We started caring about security because pre-SP3 Windows XP was an existential threat to the business. Our low performance is not an existential threat to the business.

See, component owners are generally openly hostile to outside patches: if you're a dev, accepting an outside patch makes your lead angry (due to the need to maintain this patch and to justify in in shiproom the unplanned design change), makes test angry (because test is on the hook for making sure the change doesn't break anything, and you just made work for them), and PM is angry (due to the schedule implications of code churn). There's just no incentive to accept changes from outside your own team. You can always find a reason to say "no", and you have very little incentive to say "yes".

What’s the probability of a successful exit by city?

See link for giant table :-).

The hiring crunch

Broken record: startups are also probably rejecting a lot of engineering candidates that would perform as well or better than anyone on their existing team, because tech industry hiring processes are folkloric and irrational.

Too long to excerpt. See the link!

Should you leave a bad job?

I am 42-year-old very successful programmer who has been through a lot of situations in my career so far, many of them highly demotivating. And the best advice I have for you is to get out of what you are doing. Really. Even though you state that you are not in a position to do that, you really are. It is okay. You are free. Okay, you are helping your boyfriend's startup but what is the appropriate cost for this? Would he have you do it if he knew it was crushing your soul?

I don't use the phrase "crushing your soul" lightly. When it happens slowly, as it does in these cases, it is hard to see the scale of what is happening. But this is a very serious situation and if left unchecked it may damage the potential for you to do good work for the rest of your life.

The commenters who are warning about burnout are right. Burnout is a very serious situation. If you burn yourself out hard, it will be difficult to be effective at any future job you go to, even if it is ostensibly a wonderful job. Treat burnout like a physical injury. I burned myself out once and it took at least 12 years to regain full productivity. Don't do it.

  • More broadly, the best and most creative work comes from a root of joy and excitement. If you lose your ability to feel joy and excitement about programming-related things, you'll be unable to do the best work. That this issue is separate from and parallel to burnout! If you are burned out, you might still be able to feel the joy and excitement briefly at the start of a project/idea, but they will fade quickly as the reality of day-to-day work sets in. Alternatively, if you are not burned out but also do not have a sense of wonder, it is likely you will never get yourself started on the good work.

  • The earlier in your career it is now, the more important this time is for your development. Programmers learn by doing. If you put yourself into an environment where you are constantly challenged and are working at the top threshold of your ability, then after a few years have gone by, your skills will have increased tremendously. It is like going to intensively learn kung fu for a few years, or going into Navy SEAL training or something. But this isn't just a one-time constant increase. The faster you get things done, and the more thorough and error-free they are, the more ideas you can execute on, which means you will learn faster in the future too. Over the long term, programming skill is like compound interest. More now means a LOT more later. Less now means a LOT less later.

So if you are putting yourself into a position that is not really challenging, that is a bummer day in and day out, and you get things done slowly, you aren't just having a slow time now. You are bringing down that compound interest curve for the rest of your career. It is a serious problem. If I could go back to my early career I would mercilessly cut out all the shitty jobs I did (and there were many of them).

Creating change when politically unpopular

A small anecdote. An acquaintance related a story of fixing the 'drainage' in their back yard. They were trying to grow some plants that were sensitive to excessive moisture, and the plants were dying. Not watering them, watering them a little, didn't seem to change. They died. A professional gardner suggested that their problem was drainage. So they dug down about 3' (where the soil was very very wet) and tried to build in better drainage. As they were on the side of a hill, water table issues were not considered. It turned out their "problem" was that the water main that fed their house and the houses up the hill, was so pressurized at their property (because it had maintain pressure at the top of the hill too) that the pipe seams were leaking and it was pumping gallons of water into the ground underneath their property. The problem wasn't their garden, the problem was that the city water supply was poorly designed.

While I have never been asked if I was an engineer on the phone, I have experienced similar things to Rachel in meetings and with regard to suggestions. Co-workers will create an internal assessment of your value and then respond based on that assessment. If they have written you off they will ignore you, if you prove their assessment wrong in a public forum they will attack you. These are management issues, and something which was sorely lacking in the stories.

If you are the "owner" of a meeting, and someone is trying to be heard and isn't. It is incumbent on you to let them be heard. By your position power as "the boss" you can naturally interrupt a discussion to collect more data from other members. Its also important to ask questions like "does anyone have any concerns?" to draw out people who have valid input but are too timid to share it.

In a highly political environment there are two ways to create change, one is through overt manipulation, which is to collect political power to yourself and then exert it to enact change, and the other is covert manipulation, which is to enact change subtly enough that the political organism doesn't react. (sometimes called "triggering the antibodies").

The problem with the latter is that if you help make positive change while keeping everyone not pissed off, no one attributes it to you (which is good for the change agent because if they knew the anti-bodies would react, but bad if your manager doesn't recognize it). I asked my manager what change he wanted to be 'true' yet he (or others) had been unsuccessful making true, he gave me one, and 18 months later that change was in place. He didn't believe that I was the one who had made the change. I suggested he pick a change he wanted to happen and not tell me, then in 18 months we could see if that one happened :-). But he also didn't understand enough about organizational dynamics to know that making change without having the source of that change point back at you was even possible.

How to get tech support from Google

Heavily relying on Google product? ✓
Hitting a dead-end with Google's customer service? ✓
Have an existing audience you can leverage to get some random Google employee's attention? ✓
Reach front page of Hacker News? ✓
Good news! You should have your problem fixed in 2-5 business days. The rest of us suckers relying on google services get to stare at our inboxes helplessly, waiting for a response to our support ticket (which will never come). I feel like it's almost a [rite] of passage these days to rely heavily on a Google service, only to have something go wrong and be left out in the cold.

Taking funding

IIRC PayPal was very similar - it was sold for $1.5B, but Max Levchin's share was only about $30M, and Elon Musk's was only about $100M. By comparison, many early Web 2.0 darlings (Del.icio.us, Blogger, Flickr) sold for only $20-40M, but their founders had only taken small seed rounds, and so the vast majority of the purchase price went to the founders. 75% of a $40M acquisition = 3% of a $1B acquisition.

Something for founders to think about when they're taking funding. If you look at the gigantic tech fortunes - Gates, Page/Brin, Omidyar, Bezos, Zuckerburg, Hewlett/Packard - they usually came from having a company that was already profitable or was already well down the hockey-stick user growth curve and had a clear path to monetization by the time they sought investment. Companies that fight tooth & nail for customers and need lots of outside capital to do it usually have much worse financial outcomes.

StackOverflow vs. Experts-Exchange

A lot of the people who were involved in some way in Experts-Exchange don't understand Stack Overflow.

The basic value flow of EE is that "experts" provide valuable "answers" for novices with questions. In that equation there's one person asking a question and one person writing an answer.

Stack Overflow recognizes that for every person who asks a question, 100 - 10,000 people will type that same question into Google and find an answer that has already been written. In our equation, we are a community of people writing answers that will be read by hundreds or thousands of people. Ours is a project more like wikipedia -- collaboratively creating a resource for the Internet at large.

Because that resource is provided by the community, it belongs to the community. That's why our data is freely available and licensed under creative commons. We did this specifically because of the negative experience we had with EE taking a community-generated resource and deciding to slap a paywall around it.

The attitude of many EE contributors, like Greg Young who calculates that he "worked" for half a year for free, is not shared by the 60,000 people who write answers on SO every month. When you talk to them you realize that on Stack Overflow, answering questions is about learning. It's about creating a permanent artifact to make the Internet better. It's about helping someone solve a problem in five minutes that would have taken them hours to solve on their own. It's not about working for free.

As soon as EE introduced the concept of money they forced everybody to think of their work on EE as just that -- work.

Making money from amazon bots

I saw that one of my old textbooks was selling for a nice price, so I listed it along with two other used copies. I priced it $1 cheaper than the lowest price offered, but within an hour both sellers had changed their prices to $.01 and $.02 cheaper than mine. I reduced it two times more by $1, and each time they beat my price by a cent or two. So what I did was reduce my price by a few dollars every hour for one day until everybody was priced under $5. Then I bought their books and changed my price back.

What running a business is like

While I like the sentiment here, I think the danger is that engineers might come to the mistaken conclusion that making pizzas is the primary limiting reagent to running a successful pizzeria. Running a successful pizzeria is more about schlepping to local hotels and leaving them 50 copies of your menu to put at the front desk, hiring drivers who will both deliver pizzas in a timely fashion and not embezzle your (razor-thin) profits while also costing next-to-nothing to employ, maintaining a kitchen in sufficient order to pass your local health inspector's annual visit (and dealing with 47 different pieces of paper related to that), being able to juggle priorities like "Do I take out a bank loan to build a new brick-oven, which will make the pizza taste better, in the knowledge that this will commit $3,000 of my cash flow every month for the next 3 years, or do I hire an extra cook?", sourcing ingredients such that they're available in quantity and quality every day for a fairly consistent price, setting prices such that they're locally competitive for your chosen clientele but generate a healthy gross margin for the business, understanding why a healthy gross margin really doesn't imply a healthy net margin and that the rent still needs to get paid, keeping good-enough records such that you know whether your business is dying before you can't make payroll and such that you can provide a reasonably accurate picture of accounts for the taxation authorities every year, balancing 50% off medium pizza promotions with the desire to not cannibalize the business of your regulars, etc etc, and by the way tomato sauce should be tangy but not sour and cheese should melt with just the faintest whisp of a crust on it.

Do you want to write software for a living? Google is hiring. Do you want to run a software business? Godspeed. Software is now 10% of your working life.

How to handle mismanagement?

The way I prefer to think of it is: it is not your job to protect people (particularly senior management) from the consequences of their decisions. Make your decisions in your own best interest; it is up to the organization to make sure that your interest aligns with theirs.

Google used to have a severe problem where code refactoring & maintenance was not rewarded in performance reviews while launches were highly regarded, which led to the effect of everybody trying to launch things as fast as possible and nobody cleaning up the messes left behind. Eventually launches started getting slowed down, Larry started asking "Why can't we have nice things?", and everybody responded "Because you've been paying us to rack up technical debt." As a result, teams were formed with the express purpose of code health & maintenance, those teams that were already working on those goals got more visibility, and refactoring contributions started counting for something in perf. Moreover, many ex-Googlers who were fed up with the situation went to Facebook and, I've heard, instituted a culture there where grungy engineering maintenance is valued by your peers.

None of this would've happened if people had just heroically fallen on their own sword and burnt out doing work nobody cared about. Sometimes it takes highly visible consequences before people with decision-making power realize there's a problem and start correcting it. If those consequences never happen, they'll keep believing it's not a problem and won't pay much attention to it.

Some downsides of immutability

Taking responsibility

The thing my grandfather taught me was that you live with all of your decisions for the rest of your life. When you make decisions which put other people at risk, you take on the risk that you are going to make someones life harder, possibly much harder. What is perhaps even more important is that no amount of "I'm so sorry I did that ..." will ever undo it. Sometimes its little things, like taking the last serving because you thought everyone had eaten, sometimes its big things like deciding that home is close enough that and you're sober enough to get there safely. They are all decisions we make every day. And as I've gotten older the weight of ones I wish I had made differently doesn't get any lighter. You can lie to yourself about your choices, rationalize them, but that doesn't change them either.

I didn't understand any of that when I was younger.

People who aren’t exactly lying

It took me too long to figure this out. There are some people to truly, and passionately, believe something they say to you, and realistically they personally can't make it happen so you can't really bank on that 'promise.'

I used to think those people were lying to take advantage, but as I've gotten older I have come to recognize that these 'yes' people get promoted a lot. And for some of them, they really do believe what they are saying.

As an engineer I've found that once I can 'calibrate' someone's 'yes-ness' I can then work with them, understanding that they only make 'wishful' commitments rather than 'reasoned' commitments.

So when someone, like Steve Jobs, says "we're going to make it an open standard!", my first question then is "Great, I've got your support in making this an open standard so I can count on you to wield your position influence to aid me when folks line up against that effort, right?" If the answer that that question is no, then they were lying.

The difference is subtle of course but important. Steve clearly doesn't go to standards meetings and vote etc, but if Manager Bob gets push back from accounting that he's going to exceed his travel budget by sending 5 guys to the Open Video Chat Working Group which is championing the Facetime protocol as an open standard, then Manager Bob goes to Steve and says "I need your help here, these 5 guys are needed to argue this standard and keep it from being turned into a turd by the 5 guys from Google who are going to attend." and then Steve whips off a one liner to accounting that says "Get off this guy's back we need this." Then its all good. If on the other hand he says "We gotta save money, send one guy." well in that case I'm more sympathetic to the accusation of prevarication.

What makes engineers productive?

For those who work inside Google, it's well worth it to look at Jeff & Sanjay's commit history and code review dashboard. They aren't actually all that much more productive in terms of code written than a decent SWE3 who knows his codebase.

The reason they have a reputation as rockstars is that they can apply this productivity to things that really matter; they're able to pick out the really important parts of the problem and then focus their efforts there, so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn't really have happened anyway, or migrating from one system to another that isn't really a large improvement, or going down an architectural dead end that'll just have to be rewritten later. Jeff or Sanjay (or any of the other folks operating at that level) will spend their time running a proposed API by clients to ensure it meets their needs, or measuring the performance of subsystems so they fully understand their building blocks, or mentally simulating the operation of the system before building it so they rapidly test out alternatives. They don't actually write more code than a junior developer (oftentimes, they write less), but the code they do write gives them more information, which makes them ensure that they write the rightcode.

I feel like this point needs to be stressed a whole lot more than it is, as there's a whole mythology that's grown up around 10x developers that's not all that helpful. In particular, people need to realize that these developers rapidly become 1x developers (or worse) if you don't let them make their own architectural choices - the reason they're excellent in the first place is because they know how to determine if certain work is going to be useless and avoid doing it in the first place. If you dictate that they do it anyway, they're going to be just as slow as any other developer

Do the work, be a hero

I got the hero speech too, once. If anyone ever mentions the word "heroic" again and there isn't a burning building involved, I will start looking for new employment immediately. It seems that in our industry it is universally a code word for "We're about to exploit you because the project is understaffed and under budgeted for time and that is exactly as we planned it so you'd better cowboy up."

Maybe it is different if you're writing Quake, but I guarantee you the 43rd best selling game that year also had programmers "encouraged onwards" by tales of the glory that awaited after the death march.

Learning English from watching movies

I was once speaking to a good friend of mine here, in English.
"Do you want to go out for yakitori?"
"Go fuck yourself!"
"... switches to Japanese Have I recently done anything very major to offend you?"
"No, of course not."
"Oh, OK, I was worried. So that phrase, that's something you would only say under extreme distress when you had maximal desire to offend me, or I suppose you could use it jokingly between friends, but neither you nor I generally talk that way."
"I learned it from a movie. I thought it meant ‘No.’"

Being smart and getting things done

True story: I went to a talk given by one of the 'engineering elders' (these were low Emp# engineers who were considered quite successful and were to be emulated by the workers :-) This person stated when they came to work at Google they were given the XYZ system to work on (sadly I'm prevented from disclosing the actual system). They remarked how they spent a couple of days looking over the system which was complicated and creaky, they couldn't figure it out so they wrote a new system. Yup, and they committed that. This person is a coding God are they not? (sarcasm) I asked what happened to the old system (I knew but was interested on their perspective) and they said it was still around because a few things still used it, but (quite proudly) nearly everything else had moved to their new system.

So if you were reading carefully, this person created a new system to 'replace' an existing system which they didn't understand and got nearly everyone to move to the new system. That made them uber because they got something big to put on their internal resume, and a whole crapload of folks had to write new code to adapt from the old system to this new system, which imperfectly recreated the old system (remember they didn't understand the original), such that those parts of the system that relied on the more obscure bits had yet to be converted (because nobody undersood either the dependent code or the old system apparently).

Was this person smart? Blindingly brilliant according to some of their peers. Did they get things done? Hell yes, they wrote the replacement for the XYZ system from scratch! One person? Can you imagine? Would I hire them? Not unless they were the last qualified person in my pool and I was out of time.

That anecdote encapsulates the dangerous side of smart people who get things done.

Public speaking tips

Some kids grow up on football. I grew up on public speaking (as behavioral therapy for a speech impediment, actually). If you want to get radically better in a hurry:

Too long to excerpt. See the link.

A reason a company can be a bad fit

I can relate to this, but I can also relate to the other side of the question. Sometimes it isn't me, its you. Take someone who gets things done and suddenly in your organization they aren't delivering. Could be them, but it could also be you.

I had this experience working at Google. I had a horrible time getting anything done there. Now I spent a bit of time evaluating that since it had never been the case in my career, up to that point, where I was unable to move the ball forward and I really wanted to understand that. The short answer was that Google had developed a number of people who spent much, if not all, of their time preventing change. It took me a while to figure out what motivated someone to be anti-change.

The fear was risk and safety. Folks moved around a lot and so you had people in charge of systems they didn't build, didn't understand all the moving parts of, and were apt to get a poor rating if they broke. When dealing with people in that situation one could either educate them and bring them along, or steam roll over them. Education takes time, and during that time the 'teacher' doesn't get anything done. This favors steamrolling evolutionarily :-)

So you can hire someone who gets stuff done, but if getting stuff done in your organization requires them to be an asshole, and they aren't up for that, well they aren't going to be nearly as successful as you would like them to be.

What working at Google is like

I can tell that this was written by an outsider, because it focuses on the perks and rehashes several cliches that have made their way into the popular media but aren't all that accurate.

Most Googlers will tell you that the best thing about working there is having the ability to work on really hard problems, with really smart coworkers, and lots of resources at your disposal. I remember asking my interviewer whether I could use things like Google's index if I had a cool 20% idea, and he was like "Sure. That's encouraged. Oftentimes I'll just grab 4000 or so machines and run a MapReduce to test out some hypothesis." My phone screener, when I asked him what it was like to work there, said "It's a place where really smart people go to be average," which has turned out to be both true and honestly one of the best things that I've gained from working there.

NSA vs. Black Hat

This entire event was a staged press op. Keith Alexander is a ~30 year veteran of SIGINT, electronic warfare, and intelligence, and a Four-Star US Army General --- which is a bigger deal than you probably think it is. He's a spy chief in the truest sense and a master politician. Anyone who thinks he walked into that conference hall in Caesars without a near perfect forecast of the outcome of the speech is kidding themselves.

Heckling Alexander played right into the strategy. It gave him an opportunity to look reasonable compared to his detractors, and, more generally (and alarmingly), to have the NSA look more reasonable compared to opponents of NSA surveillance. It allowed him to "split the vote" with audience reactions, getting people who probably have serious misgivings about NSA programs to applaud his calm and graceful handling of shouted insults; many of those people probably applauded simply to protest the hecklers, who after all were making it harder for them to follow what Alexander was trying to say.

There was no serious Q&A on offer at the keynote. The questions were pre-screened; all attendees could do was vote on them. There was no possibility that anything would come of this speech other than an effectively unchallenged full-throated defense of the NSA's programs.

Are deadlines necessary?

Interestingly one of the things that I found most amazing when I was working for Google was a nearly total inability to grasp the concept of 'deadline.' For so many years the company just shipped it by committing it to the release branch and having the code deploy over the course of a small number of weeks to the 'fleet'.

Sure there were 'processes', like "Canary it in some cluster and watch the results for a few weeks before turning it loose on the world." but being completely vertically integrated is a unique sort of situation.

Debugging on Windows vs. Linux

Being a very experienced game developer who tried to switch to Linux, I have posted about this before (and gotten flamed heavily by reactionary Linux people).

The main reason is that debugging is terrible on Linux. gdb is just bad to use, and all these IDEs that try to interface with gdb to "improve" it do it badly (mainly because gdb itself is not good at being interfaced with). Someone needs to nuke this site from orbit and build a new debugger from scratch, and provide a library-style API that IDEs can use to inspect executables in rich and subtle ways.

Productivity is crucial. If the lack of a reasonable debugging environment costs me even 5% of my productivity, that is too much, because games take so much work to make. At the end of a project, I just don't have 5% effort left any more. It requires everything. (But the current Linux situation is way more than a 5% productivity drain. I don't know exactly what it is, but if I were to guess, I would say it is something like 20%.)

What happens when you become rich?

What is interesting is that people don't even know they have a complex about money until they get "rich." I've watched many people, perhaps a hundred, go from "working to pay the bills" to "holy crap I can pay all my current and possibly my future bills with the money I now have." That doesn't include the guy who lived in our neighborhood and won the CA lottery one year.

It affects people in ways they don't expect. If its sudden (like lottery winning or sudden IPO surge) it can be difficult to process. But it is an important thing to realize that one is processing an exceptional event. Like having a loved one die or a spouse suddenly divorcing you.

Not everyone feels "guilty", not everyone feels "smug." A lot of millionaires and billionaires in the Bay Area are outwardly unchanged. But the bottom line is that the emotion comes from the cognitive dissonance between values and reality. What do you value? What is reality?

One woman I knew at Google was massively conflicted when she started work at Google. She always felt that she would help the homeless folks she saw, if she had more money than she needed. Upon becoming rich (on Google stock value), now she found that she wanted to save the money she had for her future kids education and needs. Was she a bad person? Before? After? Do your kids hate you if you give away their college education to the local foodbank? Do your peers hate you because you could close the current food gap at the foodbank and you don't?

Microsoft’s Skype acquisition

This is Microsoft's ICQ moment. Overpaying for a company at the moment when its core competency is becoming a commodity. Does anyone have the slightest bit of loyalty to Skype? Of course not. They're going to use whichever video chat comes built into their SmartPhone, tablet, computer, etc. They're going to use FaceBook's eventual video chat service or something Google offers. No one is going to actively seek out Skype when so many alternatives exist and are deeply integrated into the products/services they already use. Certainly no one is going to buy a Microsoft product simply because it has Skype integration. Who cares if it's FaceTime, FaceBook Video Chat, Google Video Chat? It's all the same to the user.

With $7B they should have just given away about 15 million Windows Mobile phones in the form of an epic PR stunt. It's not a bad product -- they just need to make people realize it exists. If they want to flush money down the toilet they might as well engage users in the process right?

What happened to Google Fiber?

I worked briefly on the Fiber team when it was very young (basically from 2 weeks before to 2 weeks after launch - I was on loan from Search specifically so that they could hit their launch goals). The bottleneck when I was there were local government regulations, and in fact Kansas City was chosen because it had a unified city/county/utility regulatory authority that was very favorable to Google. To lay fiber to the home, you either need right-of-ways on the utility poles (which are owned by Google's competitors) or you need permission to dig up streets (which requires a mess of permitting from the city government). In either case, the cable & phone companies were in very tight with local regulators, and so you had hostile gatekeepers whose approval you absolutely needed.

The technology was awesome (1G Internet and HDTV!), the software all worked great, and the economics of hiring contractors to lay the fiber itself actually worked out. The big problem was regulatory capture.

With Uber & AirBnB's success in hindsight, I'd say that the way to crack the ISP business is to provide your customers with the tools to break the law en masse. For example, you could imagine an ISP startup that basically says "Here's a box, a wire, and a map of other customers' locations. Plug into their jack, and if you can convince others to plug into yours, we'll give you a discount on your monthly bill based on how many you sign up." But Google in general is not willing to break laws - they'll go right up to the boundary of what the law allows, but if a regulatory agency says "No, you can't do that", they won't do it rather than fight the agency.

Indeed, Fiber is being phased out in favor of Google's acquisition of WebPass, which does basically exactly that but with wireless instead of fiber. WebPass only requires the building owner's consent, and leaves the city out of it.

What it's like to talk at Microsoft's TechEd

I've spoken at TechEds in the US and Europe, and been in the top 10 for attendee feedback twice.

I'd never speak at TechEd again, and I told Microsoft the same thing, same reasons. The event staff is overly demanding and inconsiderate of speaker time. They repeatedly dragged me into mandatory virtual and in-person meetings to cover inane details that should have been covered via email. They mandated the color of pants speakers wore. Just ridiculously micromanaged.

Why did Hertz suddenly become so flaky?

Hertz laid off nearly the entirety of their rank and file IT staff earlier this year.

In order to receive our severance, we were forced to train our IBM replacements, who were in India. Hertz's strategy of IBM and Austerity is the new SMT's solution for a balance sheet that's in shambles, yet they have rewarded themselves by increasing executive compensation 35% over the prior year, including a $6 million bonus to the CIO.

I personally landed in an Alphabet company, received a giant raise, and now I get to work on really amazing stuff, so I'm doing fine. But to this day I'm sad to think how our once-amazing Hertz team, staffed with really smart people, led by the best boss I ever had, and were really driving the innovation at Hertz, was just thrown away like yesterday's garbage.

Before startups put clauses in contracts forbidden, they sometimes blocked sales via backchannel communications

Don't count on definitely being able to sell the stock to finance the taxes. I left after seven years in very good standing (I believed) but when I went to sell the deal was shut down [1]. Luckily I had a backup plan and I was ok [2].

[1] Had a handshake deal with an investor in the company, then the investor went silent on me. When I followed up he said the deal was "just much too small." I reached out to the company for help, and they said they'd actually told him not to buy from me. I never would have known if they hadn't decided to tell me for some reason. The takeaway is that the markets for private company stock tend to be small, and the buyers care more about their relationships with the company than they do about having your shares. Even if the stock terms allow them to buy, and they might not.

An Amazon pilot program designed to reduce the cost of interviewing

I took the first test just like the OP, the logical reasoning part seemed kind of irrelevant and a waste of time for me. That was nothing compared to the second online test.

The environment of the second test was like a scenario out of Black Mirror. Not only did they want to have the webcam and microphone on the entire time, I also had to install their custom software so the proctors could monitor my screen and control my computer. They opened up the macOS system preferences so they could disable all shortcuts to take screenshots, and they also manually closed all the background services I had running (even f.lux!).

Then they asked me to pick up my laptop and show them around my room with the webcam. They specifically asked to see the contents of my desk and the walls and ceiling of my room. I had some pencil and paper on my desk to use as scratch paper for the obvious reasons and they told me that wasn't allowed. Obviously that made me a little upset because I use it to sketch out examples and concepts. They also saw my phone on the desk and asked me to put it out of arm's reach.

After that they told me I couldn't leave the room until the 5 minute bathroom break allowed half-way through the test. I had forgotten to tell my roommate I was taking this test and he was making a bit of a ruckus playing L4D2 online (obviously a bit distracting). I asked the proctor if I could briefly leave the room to ask him to quiet down. They said I couldn't leave until the bathroom break so there was nothing I could do. Later on, I was busy thinking about a problem and had adjusted how I was sitting in my chair and moved my face slightly out of the camera's view. The proctor messaged me again telling me to move so they could see my entire face.

Amazon interviews, part 2

The first part of the interview was exactly like the linked experience. No coding questions just reasoning. The second part I had to use ProctorU instead of Proctorio. Personally I thought the experience was super weird but understandable, I'll get to that later, somebody watched me through my webcam the entire time with my microphone on. They needed to check my ID before the test. They needed me to show them the entire room I was in (which was my bedroom). My desktop computer was on behind my laptop so I turned off my computer (I don't remember if I offered to or if they asked me to) but they also asked me to cover my monitors up with something which I thought was silly after I turned them off so I covered them with a towel. They then used LogMeIn to remote into my machine so they could check running programs. I quit all my personal chat programs and pretty much only had the Chrome window running.

...

I didn't talk a real person who actually worked at Amazon (by email or through webcam) until I received an offer.

What's getting acquired by Oracle like?

[M]y company got acquired by Oracle. We thought things would be OK. Nothing changed immediately. Slowly but surely they turned the screws. 5 year laptop replacement policy. You get the corporate standard laptop and you'll like it. Sales? Oh those guys can buy new Macs every two years, they get whatever they want. Then you understand where Software Engineers rank in the company hierarchy. Oracle took the average price of our product from $100k to $5 million for the same size deals. Our sales went from $5-7m to more than $40m with no increasing in engineering headcount (team of 15). Didn't matter when bonus time came, we all got stack-ranked and some people got nothing. As a top performer I got a few options, worth maybe $5k.

Oracle exists to extract the maximum amount of money possible from the Fortune 1000. Everyone else can fuck off. Your impotent internet rage is meaningless. If it doesn't piss off the CTO of $X then it doesn't matter. If it gets that CTO to cut a bigger check then it will be embraced with extreme enthusiasm.

The culture wears down a lot (but not all) of the good people, who then leave. What's left is a lot of mediocrity and architecture astronauts. The more complex the product the better - it means extra consulting dollars!

My relative works at a business dependent on Micros. When Oracle announced the acquisition I told them to start on the backup plan immediately because Oracle was going to screw them sooner or later. A few years on and that is proving true: Oracle is slowly excising the Micros dealers and ISVs out of the picture, gobbling up all the revenue while hiking prices.

How do you avoid hiring developers who do negative work?

In practice, we have to face that all that our quest for more stringent hiring standards is not really selecting the best, but just selecting fewer people, in ways that might, or might not, have anything to do with being good at a job. Let's go through a few examples in my career:

A guy that was the most prolific developer I have ever seen: He'd rewrite entire subsystems over a weekend. The problem is that said susbsytems were not necessarily better than they started, trading bugs for bugs, and anyone that wanted to work on them would have to relearn that programmer's idiosyncrasies of the week. He easily cost his project 12 man/months of work in 4 months, the length of time it took for management to realize that he had to be let go.

A company's big UI framework was quite broken, and a new developer came in and fixed it. Great, right? Well, he was handed code review veto to changes into the framework, and his standards and his demeanor made people stop contributing after two or three attempts. In practice, the framework died as people found it antiquated, and they decided to build a new one: Well, the same developer was tasked with building new framwork, which was made mandatory for 200+ developers to use. Total contribution was clearly negative.

A developer that was very fast, and wrote working code, had been managing a rather large 500K line codebase, and received some developers as help. He didn't believe in internal documentation or on keeping interfaces stable. He also didn't believe in writing code that wasn't brittle, or in unit tests: Code changes from the new developers often broke things, the veteran would come in, fix everything in the middle of the emergency, and look absolutely great, while all the other developers looked to management as if they were incompetent. They were not, however: they were quite successful when moved to other teams. It just happens that the original developer made sure nobody else could touch anything. Eventually, the experiment was retried after the original developer was sent to do other things. It took a few months, but the new replacement team managed to modularize the code, and new people could actually modify the codebase productively.

All of those negative value developers could probably be very valuable in very specific conditions, and they'd look just fine in a tough job interview. They were still terrible hires. In my experience, if anything, a harder process that demands people to appear smarter or work faster in an interview have the opposite effect of what I'd want: They end up selecting for people that think less and do more quickly, building debt faster.

My favorite developers ever all do badly in your typical stringent Silicon Valley intervew. They work slower, do more thinking, and consider every line of code they write technical debt. They won't have a million algorithms memorized: They'll go look at sources more often than not, and will spend a lot of time on tests that might as well be documentation. Very few of those traits are positive in an interview, but I think they are vital in creating good teams, but few select for them at all.

Linux and the demise of Solaris

I worked on Solaris for over a decade, and for a while it was usually a better choice than Linux, especially due to price/performance (which includes how many instances it takes to run a given workload). It was worth fighting for, and I fought hard. But Linux has now become technically better in just about every way. Out-of-box performance, tuned performance, observability tools, reliability (on patched LTS), scheduling, networking (including TCP feature support), driver support, application support, processor support, debuggers, syscall features, etc. Last I checked, ZFS worked better on Solaris than Linux, but it's an area where Linux has been catching up. I have little hope that Solaris will ever catch up to Linux, and I have even less hope for illumos: Linux now has around 1,000 monthly contributors, whereas illumos has about 15.

In addition to technology advantages, Linux has a community and workforce that's orders of magnitude larger, staff with invested skills (re-education is part of a TCO calculation), companies with invested infrastructure (rewriting automation scripts is also part of TCO), and also much better future employment prospects (a factor than can influence people wanting to work at your company on that OS). Even with my considerable and well-known Solaris expertise, the employment prospects with Solaris are bleak and getting worse every year. With my Linux skills, I can work at awesome companies like Netflix (which I highly recommend), Facebook, Google, SpaceX, etc.

Large technology-focused companies, like Netflix, Facebook, and Google, have the expertise and appetite to make a technology-based OS decision. We have dedicated teams for the OS and kernel with deep expertise. On Netflix's OS team, there are three staff who previously worked at Sun Microsystems and have more Solaris expertise than they do Linux expertise, and I believe you'll find similar people at Facebook and Google as well. And we are choosing Linux.

The choice of an OS includes many factors. If an OS came along that was better, we'd start with a thorough internal investigation, involving microbenchmarks (including an automated suite I wrote), macrobenchmarks (depending on the expected gains), and production testing using canaries. We'd be able to come up with a rough estimate of the cost savings based on price/performance. Most microservices we have run hot in user-level applications (think 99% user time), not the kernel, so it's difficult to find large gains from the OS or kernel. Gains are more likely to come from off-CPU activities, like task scheduling and TCP congestion, and indirect, like NUMA memory placement: all areas where Linux is leading. It would be very difficult to find a large gain by changing the kernel from Linux to something else. Just based on CPU cycles, the target that should have the most attention is Java, not the OS. But let's say that somehow we did find an OS with a significant enough gain: we'd then look at the cost to switch, including retraining staff, rewriting automation software, and how quickly we could find help to resolve issues as they came up. Linux is so widely used that there's a good chance someone else has found an issue, had it fixed in a certain version or documented a workaround.

What's left where Solaris/SmartOS/illumos is better? 1. There's more marketing of the features and people. Linux develops great technologies and has some highly skilled kernel engineers, but I haven't seen any serious effort to market these. Why does Linux need to? And 2. Enterprise support. Large enterprise companies where technology is not their focus (eg, a breakfast cereal company) and who want to outsource these decisions to companies like Oracle and IBM. Oracle still has Solaris enterprise support that I believe is very competitive compared to Linux offerings.~

Why wasn't RethinkDB more sucessful?

I'd argue that where RethinkDB fell down is on a step you don't list, "Understand the context of the problem", which you'd ideally do before figuring out how many people it's a problem for. Their initial idea was a MySQL storage engine for SSDs - the environmental change was that SSD prices were falling rapidly, SSDs have wildly different performance characteristics from disk, and so they figured there was an opportunity to catch the next wave. Only problem is that the biggest corporate buyers of SSDs are gigantic tech companies (eg. Google, Amazon) with large amounts of proprietary software, and so a generic MySQL storage engine isn't going to be useful to them anyway.

Unfortunately they'd already taken funding, built a team, and written a lot of code by the time they found that out, and there's only so far you can pivot when you have an ecosystem like that.

On falsehoods programmers believe about X

This unfortunately follows the conventions of the genre called "Falsehood programmers believe about X": ...

I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.

What does it mean if a company rescinds an offer because you tried to negotiate?

It happens sometimes. Usually it's because of one of two situations:

1) The company was on the fence about wanting you anyway, and negotiating takes you from the "maybe kinda sorta want to work with" to the "don't want to work with" pile.

2) The company is looking for people who don't question authority and don't stick up for their own interests.

Both of these are red flags. It's not really a matter of ethics - they're completely within their rights to withdraw an offer for any reason - but it's a matter of "Would you really want to work there anyway?" For both corporations and individuals, it usually leads to a smoother life if you only surround yourself with people who really value you.

HN comments

I feel like this is every HN discussion about "rates---comma---raising them": a mean-spirited attempt to convince the audience on the site that high rates aren't really possible, because if they were, the person telling you they're possible would be wealthy beyond the dreams of avarice. Once again: Patrick is just offering a more refined and savvy version of advice me and my Matasano friends gave him, and our outcomes are part of the record of a reasonable large public company.

This, by the way, is why I'll never write this kind of end-of-year wrap-up post (and, for the same reasons, why I'll never open source code unless I absolutely have to). It's also a big part of what I'm trying to get my hands around for the Starfighter wrap-up post. When we started Starfighter, everyone said "you're going to have such an amazing time because of all the HN credibility you have". But pretty much every time Starfighter actually came up on HN, I just wanted to hide under a rock. Even when the site is civil, it's still committed to grind away any joy you take either in accomplishing something near or even in just sharing something interesting you learned . You could sort of understand an atavistic urge to shit all over someone sharing an interesting experience that was pleasant or impressive. There's a bad Morrissey song about that. But look what happens when you share an interesting story that obviously involved significant unpleasantness and an honest accounting of one's limitations: a giant thread full of people piling on to question your motives and life choices. You can't win.

On the journalistic integrity of Quartz

I was the first person to be interviewed by this journalist (Michael Thomas @curious_founder). He approached me on Twitter to ask questions about digital nomad and remote work life (as I founded Nomad List and have been doing it for years).

I told him it'd be great to see more honest depictions as most articles are heavily idealized making it sound all great, when it's not necessarily. It's ups and downs (just like regular life really).

What happened next may surprise you. He wrote a hit piece on me changing my entire story that I told him over Skype into a clickbait article of how digital nomadism doesn't work and one of the main people doing it for awhile (en public) even settled down and gave up altogether.

I didn't settle down. I spent the summer in Amsterdam. Cause you know, it's a nice place! But he needed to say this to make a polarized hit piece with an angle. And that piece became viral. Resulting in me having to tell people daily that I didn't and getting lots of flack. You may understand it doesn't help if your entire startup is about something and a journalist writes a viral piece how you yourself don't even believe in that anymore. I contacted the journalist and Quartz but they didn't change a thing.

It's great this meant his journalistic breakthrough but it hurt me in the process.

I'd argue journalists like this are the whole problem we have these days. The articles they write can't be balanced because they need to get pageviews. Every potential to write something interesting quickly turns into clickbait. It turned me off from being interviewed ever again. Doing my own PR by posting comment sections of Hacker News or Reddit seems like a better idea (also see how Elon Musk does exactly this, seems smarter).

How did Click and Clack always manage to solve the problem?

Hope this doesn't ruin it for you, but I knew someone who had a problem presented on the show. She called in and reached an answering machine. Someone called her and qualified the problem. Then one of the brothers called and talked to her for a while. Then a few weeks later (there might have been some more calls, I don't know) both brothers called her and talked to her for a while. Her parts of that last call was edited into the radio show so it sounded like she had called and they just figured out the answer on the spot.

Why are so many people down on blockchain?

Blockchain is the world's worst database, created entirely to maintain the reputations of venture capital firms who injected hundreds of millions of dollars into a technology whose core defining insight was "You can improve on a Ponzi scam by making it self-organizing and distributed; that gets vastly more distribution, reduces the single point of failure, and makes it censorship-resistant."

That's more robust than I usually phrase things on HN, but you did ask. In slightly more detail:

Databases are wonderful things. We have a number which are actually employed in production, at a variety of institutions. They run the world. Meaningful applications run on top of Postgres, MySQL, Oracle, etc etc.

No meaningful applications run on top of "blockchain", because it is a marketing term. You cannot install blockchain just like you cannot install database. (Database sounds much cooler without the definitive article, too.) If you pick a particular instantiation of a blockchain-style database, it is a horrible, horrible database.

Can I pick on Bitcoin? Let me pick on Bitcoin. Bitcoin is claimed to be a global financial network and ready for production right now. Bitcoin cannot sustain 5 transactions per second, worldwide.

You might be sensibly interested in Bitcoin governance if, for some reason, you wanted to use Bitcoin. Bitcoin is a software artifact; it matters to users who makes changes to it and by what process. (Bitcoin is a software artifact, not a protocol, even though the Bitcoin community will tell you differently. There is a single C++ codebase which matters. It is essentially impossible to interoperate with Bitcoin without bugs-and-all replicating that codebase.) Bitcoin governance is captured by approximately ~5 people. This is a robust claim and requires extraordinary evidence.

Ordinary evidence would be pointing you, in a handwavy fashion, about the depth of acrimony with regards to raising the block size, which would let Bitcoin scale to the commanding heights of 10 or, nay, 100 transactions per second worldwide.

Extraordinary evidence might be pointing you to the time where the entire Bitcoin network was de-facto shut down based on the consensus of N people in an IRC channel. c.f. https://news.ycombinator.com/item?id=9320989 This was back in 2013. Long story short: a software update went awry so they rolled back global state by a few hours by getting the right two people to agree to it on a Skype call.

But let's get back to discussing that sole technical artifact. Bitcoin has a higher cost-to-value ratio than almost any technology conceivable; the cost to date is the market capitalization of Bitcoin. Because Bitcoin enters through a seigniorage mechanism, every Bitcoin existing was minted as compensation for "security the integrity of the blockchain" (by doing computationally expensive makework).

This cost is high. Today, routine maintenance of the Bitcoin network will cost the network approximately $1.5 million. That's on the order of $3 per write on a maximum committed capacity basis. It will cost another $1.5 million tomorrow, exchange rate depending.

(Bitcoin has successfully shifted much of the cost of operating its database to speculators rather than people who actually use Bitcoin for transaction processing. That game of musical chairs has gone on for a while.)

Bitcoin has some properties which one does not associate with many databases. One is that write acknowledgments average 5 minutes. Another is that they can stop, non-deterministically, for more than an hour at a time, worldwide, for all users simultaneously. This behavior is by design.

How big is the proprietary database market?

  1. The database market is NOT closed. In fact, we are in a database boom. Since 2009 (the year RethinkDB was founded), there have been over 100 production grade databases released in the market. These span document stores, Key/Value, time series, MPP, relational, in-memory, and the ever increasing "multi model databases."

  2. Since 2009, over $600 MILLION dollars (publicly announced) has been invested in these database companies (RethinkDB represents 12.2M or about 2%). That's aside from money invested in the bigger established databases.

  3. Almost all of the companies that have raised funding in this period generate revenue from one of more of the following areas:

a) exclusive hosting (meaning AWS et al. do not offer this product) b) multi-node/cluster support c) product enhancements c) enterprise support

Looking at each of the above revenue paths as executed by RethinkDB:

a) RethinkDB never offered a hosted solution. Compose offered a hosted solution in October of 2014. b) RethinkDB didn't support true high availability until the 2.1 release in August 2015. It was released as open source and to my knowledge was not monetized. c/d) I've heard that an enterprise version of RethinkDB was offered near the end. Enterprise Support is, empirically, a bad approach for a venture backed company. I don't know that RethinkDB ever took this avenue seriously. Correct me if I am wrong.

A model that is not popular among RECENT databases but is popular among traditional databases is a standard licensing model (e.g. Oracle, Microsoft SQL Server). Even these are becoming more rare with the advent of A, but never underestimate the licensing market.

Again, this is complete conjecture, but I believe RethinkDB failed for a few reasons:

1) not pursuing one of the above revenue models early enough. This has serious affects on the order of the feature enhancements (for instance, the HA released in 2015 could have been released earlier at a premium or to help facilitate a hosted solution).

2) incorrect priority of enhancements:

2a) general database performance never reached the point it needed to. RethinkDB struggled with both write and read performance well into 2015. There was no clear value add in this area compared to many write or read focused databases released around this time.

2b) lack of (proper) High Availability for too long.

2c) ReQL was not necessary - most developers use ORMs when interacting with SQL. When you venture into analytical queries, we actually seem to make great effort to provide SQL: look at the number of projects or companies that exist to bring SQL to databases and filesystems that don't support it (Hive, Pig, Slam Data, etc).

2d) push notifications. This has not been demonstrated to be a clear market need yet. There are a small handful of companies that promoting development stacks around this, but no database company is doing the same.

2e) lack of focus. What was RethinkDB REALLY good at? It push ReQL and joins at first, but it lacked HA until 2015, struggled with high write or read loads into 2015. It then started to focus on real time notifications. Again, there just aren't many databases focusing on these areas.

My final thought is that RethinkDB didn't raise enough capital. Perhaps this is because of previous points, but without capital, the above can't be corrected. RethinkDB actually raised far less money than basically any other venture backed company in this space during this time.

Again, I've never run a database company so my thoughts are just from an outsider. However, I am the founder of a company that provides database integration products so I monitor this industry like I hawk. I simply don't agree that the database market has been "captured."

I expect to see even bigger growth in databases in the future. I'm happy to share my thoughts about what types of databases are working and where the market needs solutions. Additionally, companies are increasingly relying on third part cloud services for data they previously captured themselves. Anything from payment processes, order fulfillment, traffic analytics etc is now being handled by someone else.

A Google Maps employee's opinion on the Google Maps pricing change

I was a googler working on Google maps at the time of the API self immolation.

There were strong complaints from within about the price changes. Obviously everyone couldn't believe what was being planned, and there were countless spreadsheets and reports and SQL queries showing how this was going to shit all over a lot of customers that we'd be guaranteed to lose to a competitor.

Management didn't give a shit.

I don't know what the rationale was apart from some vague claim about "charging for value". A lot of users of the API apparently were basically under the free limits or only spending less than 100 USD on API usage so I can kind of understand the line of thought, but I still.thibk they went way too far.

I don't know what happened to the architects of the plan. I presume promo.

Edit: I should add that this was not a knee-jerk thing or some exec just woke up one day with an idea in their dreams. It was a planned change that took many months to plan and prepare for with endless preparations and reporting and so on.

???

How did HN get get the commenter base that it has? If you read HN, on any given week, there are at least as many good, substantial, comments as there are posts. This is different from every other modern public news aggregator I can find out there, and I don’t really know what the ingredients are that make HN successful.

For the last couple years (ish?), the moderation regime has been really active in trying to get a good mix of stories on the front page and in tamping down on gratuitously mean comments. But there was a period of years where the moderation could be described as sparse, arbitrary, and capricious, and while there are fewer “bad” comments now, it doesn’t seem like good moderation actually generates more “good” comments.

The ranking scheme seems to penalize posts that have a lot of comments on the theory that flamebait topics will draw a lot of comments. That sometimes prematurely buries stories with good discussion, but much more often, it buries stories that draw pointless flamewars. If you just read HN, it’s hard to see the effect, but if you look at forums that use comments as a positive factor in ranking, the difference is dramatic -- those other forums that boost topics with many comments (presumably on theory that vigorous discussion should be highlighted) often have content-free flame wars pinned at the top for long periods of time.

Something else that HN does that’s different from most forums is that user flags are weighted very heavily. On reddit, a downvote only cancels out an upvote, which means that flamebait topics that draw a lot of upvotes like “platform X is cancer” “Y is doing some horrible thing” often get pinned to the top of r/programming for a an entire day, since the number of people who don’t want to see that is drowned out by the number of people who upvote outrageous stories. If you read the comments for one of the "X is cancer" posts on r/programming, the top comment will almost inevitably that the post has no content, that the author of the post is a troll who never posts anything with content, and that we'd be better off with less flamebait by the author at the top of r/programming. But the people who will upvote outrage porn outnumber the people who will downvote it, so that kind of stuff dominates aggregators that use raw votes for ranking. Having flamebait drop off the front page quickly is significant, but it doesn’t seem sufficient to explain why there are so many more well-informed comments on HN than on other forums with roughly similar traffic.

Maybe the answer is that people come to HN for the same reason people come to Silicon Valley -- despite all the downsides, there’s a relatively large concentration of experts there across a wide variety of CS-related disciplines. If that’s true, and it’s a combination of path dependence on network effects, that’s pretty depressing since that’s not replicable.

If you liked this curated list of comments, you'll probably also like this list of books and this list of blogs.

This is part of an experiment where I write up thoughts quickly, without proofing or editing. Apologies if this is less clear than a normal post. This is probably going to be the last post like this, for now, since, by quickly writing up a post whenever I have something that can be written up quickly, I'm building up a backlog of post ideas that require re-reading the literature in an area or running experiments.

P.S. Please suggest other good comments! By their nature, HN comments are much less discoverable than stories, so there are a lot of great coments that I haven't seen.


  1. if you’re one of those people, you’ve probably already thought of this, but maybe consider, at the margin, blogging more and commenting on HN less? As a result of writing this post, I looked through my old HN comments and noticed that I wrote this comment three years ago, which is another way of stating the second half of this post I wrote recently. Comparing the two, I think the HN comment is substantially better written. But, like most HN comments, it got some traffic while the story was still current and is now buried, and AFAICT, nothing really happened as a result of the comment. The blog post, despite being “worse”, has gotten some people to contact me personally, and I’ve had some good discussions about that and other topics as a result. Additionally, people occasionally contact me about older posts I’ve written; I continue to get interesting stuff in my inbox as a result of having written posts years ago. Writing your comment up as a blog post will almost certainly provide more value to you, and if it gets posted to HN, it will probably provide no less value to HN.

    Steve Yegge has a pretty list of reasons why you should blog that I won’t recapitulate here. And if you’re writing substantial comments on HN, you’re already doing basically everything you’d need to do to write a blog except that you’re putting the text into a little box on HN instead of into a static site generator or some hosted blogging service. BTW, I’m not just saying this for your benefit: my selfish reason for writing this appeal is that I really want to read the Nathan Kurz blog on low-level optimizations, the Jonathan Tang blog on what it’s like to work at startups vs. big companies, etc.

    [return]

Programming book recommendations and anti-recommendations

2016-10-16 16:06:34

There are a lot of “12 CS books every programmer must read” lists floating around out there. That's nonsense. The field is too broad for almost any topic to be required reading for all programmers, and even if a topic is that important, people's learning preferences differ too much for any book on that topic to be the best book on the topic for all people.

This is a list of topics and books where I've read the book, am familiar enough with the topic to say what you might get out of learning more about the topic, and have read other books and can say why you'd want to read one book over another.

Algorithms / Data Structures / Complexity

Why should you care? Well, there's the pragmatic argument: even if you never use this stuff in your job, most of the best paying companies will quiz you on this stuff in interviews. On the non-bullshit side of things, I find algorithms to be useful in the same way I find math to be useful. The probability of any particular algorithm being useful for any particular problem is low, but having a general picture of what kinds of problems are solved problems, what kinds of problems are intractable, and when approximations will be effective, is often useful.

McDowell; Cracking the Coding Interview

Some problems and solutions, with explanations, matching the level of questions you see in entry-level interviews at Google, Facebook, Microsoft, etc. I usually recommend this book to people who want to pass interviews but not really learn about algorithms. It has just enough to get by, but doesn't really teach you the why behind anything. If you want to actually learn about algorithms and data structures, see below.

Dasgupta, Papadimitriou, and Vazirani; Algorithms

Everything about this book seems perfect to me. It breaks up algorithms into classes (e.g., divide and conquer or greedy), and teaches you how to recognize what kind of algorithm should be used to solve a particular problem. It has a good selection of topics for an intro book, it's the right length to read over a few weekends, and it has exercises that are appropriate for an intro book. Additionally, it has sub-questions in the middle of chapters to make you reflect on non-obvious ideas to make sure you don't miss anything.

I know some folks don't like it because it's relatively math-y/proof focused. If that's you, you'll probably prefer Skiena.

Skiena; The Algorithm Design Manual

The longer, more comprehensive, more practical, less math-y version of Dasgupta. It's similar in that it attempts to teach you how to identify problems, use the correct algorithm, and give a clear explanation of the algorithm. Book is well motivated with “war stories” that show the impact of algorithms in real world programming.

CLRS; Introduction to Algorithms

This book somehow manages to make it into half of these “N books all programmers must read” lists despite being so comprehensive and rigorous that almost no practitioners actually read the entire thing. It's great as a textbook for an algorithms class, where you get a selection of topics. As a class textbook, it's nice bonus that it has exercises that are hard enough that they can be used for graduate level classes (about half the exercises from my grad level algorithms class were pulled from CLRS, and the other half were from Kleinberg & Tardos), but this is wildly impractical as a standalone introduction for most people.

Just for example, there's an entire chapter on Van Emde Boas trees. They're really neat -- it's a little surprising that a balanced-tree-like structure with O(lg lg n) insert, delete, as well as find, successor, and predecessor is possible, but a first introduction to algorithms shouldn't include Van Emde Boas trees.

Kleinberg & Tardos; Algorithm Design

Same comments as for CLRS -- it's widely recommended as an introductory book even though it doesn't make sense as an introductory book. Personally, I found the exposition in Kleinberg to be much easier to follow than in CLRS, but plenty of people find the opposite.

Demaine; Advanced Data Structures

This is a set of lectures and notes and not a book, but if you want a coherent (but not intractably comprehensive) set of material on data structures that you're unlikely to see in most undergraduate courses, this is great. The notes aren't designed to be standalone, so you'll want to watch the videos if you haven't already seen this material.

Okasaki; Purely Functional Data Structures

Fun to work through, but, unlike the other algorithms and data structures books, I've yet to be able to apply anything from this book to a problem domain where performance really matters.

For a couple years after I read this, when someone would tell me that it's not that hard to reason about the performance of purely functional lazy data structures, I'd ask them about part of a proof that stumped me in this book. I'm not talking about some obscure super hard exercise, either. I'm talking about something that's in the main body of the text that was considered too obvious to the author to explain. No one could explain it. Reasoning about this kind of thing is harder than people often claim.

Dominus; Higher Order Perl

A gentle introduction to functional programming that happens to use Perl. You could probably work through this book just as easily in Python or Ruby.

If you keep up with what's trendy, this book might seem a bit dated today, but only because so many of the ideas have become mainstream. If you're wondering why you should care about this "functional programming" thing people keep talking about, and some of the slogans you hear don't speak to you or are even off-putting (types are propositions, it's great because it's math, etc.), give this book a chance.

Levitin; Algorithms

I ordered this off amazon after seeing these two blurbs: “Other learning-enhancement features include chapter summaries, hints to the exercises, and a detailed solution manual.” and “Student learning is further supported by exercise hints and chapter summaries.” One of these blurbs is even printed on the book itself, but after getting the book, the only self-study resources I could find were some yahoo answers posts asking where you could find hints or solutions.

I ended up picking up Dasgupta instead, which was available off an author's website for free.

Mitzenmacher & Upfal; Probability and Computing: Randomized Algorithms and Probabilistic Analysis

I've probably gotten more mileage out of this than out of any other algorithms book. A lot of randomized algorithms are trivial to port to other applications and can simplify things a lot.

The text has enough of an intro to probability that you don't need to have any probability background. Also, the material on tails bounds (e.g., Chernoff bounds) is useful for a lot of CS theory proofs and isn't covered in the intro probability texts I've seen.

Sipser; Introduction to the Theory of Computation

Classic intro to theory of computation. Turing machines, etc. Proofs are often given at an intuitive, “proof sketch”, level of detail. A lot of important results (e.g, Rice's Theorem) are pushed into the exercises, so you really have to do the key exercises. Unfortunately, most of the key exercises don't have solutions, so you can't check your work.

For something with a more modern topic selection, maybe see Aurora & Barak.

Bernhardt; Computation

Covers a few theory of computation highlights. The explanations are delightful and I've watched some of the videos more than once just to watch Bernhardt explain things. Targeted at a general programmer audience with no background in CS.

Kearns & Vazirani; An Introduction to Computational Learning Theory

Classic, but dated and riddled with errors, with no errata available. When I wanted to learn this material, I ended up cobbling together notes from a couple of courses, one by Klivans and one by Blum.

Operating Systems

Why should you care? Having a bit of knowledge about operating systems can save days or week of debugging time. This is a regular theme on Julia Evans's blog, and I've found the same thing to be true of my experience. I'm hard pressed to think of anyone who builds practical systems and knows a bit about operating systems who hasn't found their operating systems knowledge to be a time saver. However, there's a bias in who reads operating systems books -- it tends to be people who do related work! It's possible you won't get the same thing out of reading these if you do really high-level stuff.

Silberchatz, Galvin, and Gagne; Operating System Concepts

This was what we used at Wisconsin before the comet book became standard. I guess it's ok. It covers concepts at a high level and hits the major points, but it's lacking in technical depth, details on how things work, advanced topics, and clear exposition.

Cox, Kasshoek, and Morris; xv6

This book is great! It explains how you can actually implement things in a real system, and it comes with its own implementation of an OS that you can play with. By design, the authors favor simple implementations over optimized ones, so the algorithms and data structures used are often quite different than what you see in production systems.

This book goes well when paired with a book that talks about how more modern operating systems work, like Love's Linux Kernel Development or Russinovich's Windows Internals.

Arpaci-Dusseau and Arpaci-Dusseau; Operating Systems: Three Easy Pieces

Nice explanation of a variety of OS topics. Goes into much more detail than any other intro OS book I know of. For example, the chapters on file systems describe the details of multiple, real, filessytems, and discusses the major implementation features of ext4. If I have one criticism about the book, it's that it's very *nix focused. Many things that are described are simply how things are done in *nix and not inherent, but the text mostly doesn't say when something is inherent vs. when it's a *nix implementation detail.

Love; Linux Kernel Development

The title can be a bit misleading -- this is basically a book about how the Linux kernel works: how things fit together, what algorithms and data structures are used, etc. I read the 2nd edition, which is now quite dated. The 3rd edition has some updates, but introduced some errors and inconsistencies, and is still dated (it was published in 2010, and covers 2.6.34). Even so, it's a nice introduction into how a relatively modern operating system works.

The other downside of this book is that the author loses all objectivity any time Linux and Windows are compared. Basically every time they're compared, the author says that Linux has clearly and incontrovertibly made the right choice and that Windows is doing something stupid. On balance, I prefer Linux to Windows, but there are a number of areas where Windows is superior, as well as areas where there's parity but Windows was ahead for years. You'll never find out what they are from this book, though.

Russinovich, Solomon, and Ionescu; Windows Internals

The most comprehensive book about how a modern operating system works. It just happens to be about Windows. Coming from a *nix background, I found this interesting to read just to see the differences.

This is definitely not an intro book, and you should have some knowledge of operating systems before reading this. If you're going to buy a physical copy of this book, you might want to wait until the 7th edition is released (early in 2017).

Downey; The Little Book of Semaphores

Takes a topic that's normally one or two sections in an operating systems textbook and turns it into its own 300 page book. The book is a series of exercises, a bit like The Little Schemer, but with more exposition. It starts by explaining what semaphore is, and then has a series of exercises that builds up higher level concurrency primitives.

This book was very helpful when I first started to write threading/concurrency code. I subscribe to the Butler Lampson school of concurrency, which is to say that I prefer to have all the concurrency-related code stuffed into a black box that someone else writes. But sometimes you're stuck writing the black box, and if so, this book has a nice introduction to the style of thinking required to write maybe possibly not totally wrong concurrent code.

I wish someone would write a book in this style, but both lower level and higher level. I'd love to see exercises like this, but starting with instruction-level primitives for a couple different architectures with different memory models (say, x86 and Alpha) instead of semaphores. If I'm writing grungy low-level threading code today, I'm overwhelmingly likely to be using c++11 threading primitives, so I'd like something that uses those instead of semaphores, which I might have used if I was writing threading code against the Win32 API. But since that book doesn't exist, this seems like the next best thing.

I've heard that Doug Lea's Concurrent Programming in Java is also quite good, but I've only taken a quick look at it.

Computer architecture

Why should you care? The specific facts and trivia you'll learn will be useful when you're doing low-level performance optimizations, but the real value is learning how to reason about tradeoffs between performance and other factors, whether that's power, cost, size, weight, or something else.

In theory, that kind of reasoning should be taught regardless of specialization, but my experience is that comp arch folks are much more likely to “get” that kind of reasoning and do back of the envelope calculations that will save them from throwing away a 2x or 10x (or 100x) factor in performance for no reason. This sounds obvious, but I can think of multiple production systems at large companies that are giving up 10x to 100x in performance which are operating at a scale where even a 2x difference in performance could pay a VP's salary -- all because people didn't think through the performance implications of their design.

Hennessy & Patterson; Computer Architecture: A Quantitative Approach

This book teaches you how to do systems design with multiple constraints (e.g., performance, TCO, and power) and how to reason about tradeoffs. It happens to mostly do so using microprocessors and supercomputers as examples.

New editions of this book have substantive additions and you really want the latest version. For example, the latest version added, among other things, a chapter on data center design, and it answers questions like, how much opex/capex is spent on power, power distribution, and cooling, and how much is spent on support staff and machines, what's the effect of using lower power machines on tail latency and result quality (bing search results are used as an example), and what other factors should you consider when designing a data center.

Assumes some background, but that background is presented in the appendices (which are available online for free).

Shen & Lipasti: Modern Processor Design

Presents most of what you need to know to architect a high performance Pentium Pro (1995) era microprocessor. That's no mean feat, considering the complexity involved in such a processor. Additionally, presents some more advanced ideas and bounds on how much parallelism can be extracted from various workloads (and how you might go about doing such a calculation). Has an unusually large section on value prediction, because the authors invented the concept and it was still hot when the first edition was published.

For pure CPU architecture, this is probably the best book available.

Hill, Jouppi, and Sohi, Readings in Computer Architecture

Read for historical reasons and to see how much better we've gotten at explaining things. For example, compare Amdahl's paper on Amdahl's law (two pages, with a single non-obvious graph presented, and no formulas), vs. the presentation in a modern textbook (one paragraph, one formula, and maybe one graph to clarify, although it's usually clear enough that no extra graph is needed).

This seems to be worse the further back you go; since comp arch is a relatively young field, nothing here is really hard to understand. If you want to see a dramatic example of how we've gotten better at explaining things, compare Maxwell's original paper on Maxwell's equations to a modern treatment of the same material. Fun if you like history, but a bit of slog if you're just trying to learn something.

Algorithmic game theory / auction theory / mechanism design

Why should you care? Some of the world's biggest tech companies run on ad revenue, and those ads are sold through auctions. This field explains how and why they work. Additionally, this material is useful any time you're trying to figure out how to design systems that allocate resources effectively.1

In particular, incentive compatible mechanism design (roughly, how to create systems that provide globally optimal outcomes when people behave in their own selfish best interest) should be required reading for anyone who designs internal incentive systems at companies. If you've ever worked at a large company that "gets" this and one that doesn't, you'll see that the company that doesn't get it has giant piles of money that are basically being lit on fire because the people who set up incentives created systems that are hugely wasteful. This field gives you the background to understand what sorts of mechanisms give you what sorts of outcomes; reading case studies gives you a very long (and entertaining) list of mistakes that can cost millions or even billions of dollars.

Krishna; Auction Theory

The last time I looked, this was the only game in town for a comprehensive, modern, introduction to auction theory. Covers the classic second price auction result in the first chapter, and then moves on to cover risk aversion, bidding rings, interdependent values, multiple auctions, asymmetrical information, and other real-world issues.

Relatively dry. Unlikely to be motivating unless you're already interested in the topic. Requires an understanding of basic probability and calculus.

Steighlitz; Snipers, Shills, and Sharks: eBay and Human Behavior

Seems designed as an entertaining introduction to auction theory for the layperson. Requires no mathematical background and relegates math to the small print. Covers maybe, 1/10th of the material of Krishna, if that. Fun read.

Crampton, Shoham, and Steinberg; Combinatorial Auctions

Discusses things like how FCC spectrum auctions got to be the way they are and how “bugs” in mechanism design can leave hundreds of millions or billions of dollars on the table. This is one of those books where each chapter is by a different author. Despite that, it still manages to be coherent and I didn't mind reading it straight through. It's self-contained enough that you could probably read this without reading Krishna first, but I wouldn't recommend it.

Shoham and Leyton-Brown; Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations

The title is the worst thing about this book. Otherwise, it's a nice introduction to algorithmic game theory. The book covers basic game theory, auction theory, and other classic topics that CS folks might not already know, and then covers the intersection of CS with these topics. Assumes no particular background in the topic.

Nisan, Roughgarden, Tardos, and Vazirani; Algorithmic Game Theory

A survey of various results in algorithmic game theory. Requires a fair amount of background (consider reading Shoham and Leyton-Brown first). For example, chapter five is basically Devanur, Papadimitriou, Saberi, and Vazirani's JACM paper, Market Equilibrium via a Primal-Dual Algorithm for a Convex Program, with a bit more motivation and some related problems thrown in. The exposition is good and the result is interesting (if you're into that kind of thing), but it's not necessarily what you want if you want to read a book straight through and get an introduction to the field.

Misc

Beyer, Jones, Petoff, and Murphy; Site Reliability Engineering

A description of how Google handles operations. Has the typical Google tone, which is off-putting to a lot of folks with a “traditional” ops background, and assumes that many things can only be done with the SRE model when they can, in fact, be done without going full SRE.

For a much longer description, see this 22 page set of notes on Google's SRE book.

Fowler, Beck, Brant, Opdyke, and Roberts; Refactoring

At the time I read it, it was worth the price of admission for the section on code smells alone. But this book has been so successful that the ideas of refactoring and code smells have become mainstream.

Steve Yegge has a great pitch for this book:

When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: "Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course." Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.

...

If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!

Demarco & Lister, Peopleware

This book seemed convincing when I read it in college. It even had all sorts of studies backing up what they said. No deadlines is better than having deadlines. Offices are better than cubicles. Basically all devs I talk to agree with this stuff.

But virtually every successful company is run the opposite way. Even Microsoft is remodeling buildings from individual offices to open plan layouts. Could it be that all of this stuff just doesn't matter that much? If it really is that important, how come companies that are true believers, like Fog Creek, aren't running roughshod over their competitors?

This book agrees with my biases and I'd love for this book to be right, but the meta evidence makes me want to re-read this with a critical eye and look up primary sources.

Drummond; Renegades of the Empire

This book explains how Microsoft's aggressive culture got to be the way it is today. The intro reads:

Microsoft didn't necessarily hire clones of Gates (although there were plenty on the corporate campus) so much as recruiter those who shared some of Gates's more notable traits -- arrogance, aggressiveness, and high intelligence.

Gates is infamous for ridiculing someone's idea as “stupid”, or worse, “random”, just to see how he or she defends a position. This hostile managerial technique invariably spread through the chain of command and created a culture of conflict.

Microsoft nurtures a Darwinian order where resources are often plundered and hoarded for power, wealth, and prestige. A manager who leaves on vacation might return to find his turf raided by a rival and his project put under a different command or canceled altogether

On interviewing at Microsoft:

“What do you like about Microsoft?” “Bill kicks ass”, St. John said. “I like kicking ass. I enjoy the feeling of killing competitors and dominating markets”.

He was unsure how he was doing and thought he stumbled then asked if he was a "people person". "No, I think most people are idiots", St. John replied.

These answers were exactly what Microsoft was looking for. They resulted in a strong offer and an aggresive courtship.

On developer evangalism at Microsoft:

At one time, Microsoft evangelists were also usually chartered with disrupting competitors by showing up at their conferences, securing positions on and then tangling standards commitees, and trying to influence the media.

"We're the group at Microsoft whose job is to fuck Microsoft's competitors"

Read this book if you're considering a job at Microsoft. Although it's been a long time since the events described in this book, you can still see strains of this culture in Microsoft today.

Bilton; Hatching Twitter

An entertaining book about the backstabbing, mismangement, and random firings that happened in Twitter's early days. When I say random, I mean that there were instances where critical engineers were allegedly fired so that the "decider" could show other important people that current management was still in charge.

I don't know folks who were at Twitter back then, but I know plenty of folks who were at the next generation of startups in their early days and there are a couple of companies where people had eerily similar experiences. Read this book if you're considering a job at a trendy startup.

Galenson; Old Masters and Young Geniuses

This book is about art and how productivity changes with age, but if its thesis is valid, it probably also applies to programming. Galenson applies statistics to determine the "greatness" of art and then uses that to draw conclusions about how the productivty of artists change as they age. I don't have time to go over the data in detail, so I'll have to remain skeptical of this until I have more free time, but I think it's interesting reading even for a skeptic.

Math

Why should you care? From a pure ROI perspective, I doubt learning math is “worth it” for 99% of jobs out there. AFAICT, I use math more often than most programmers, and I don't use it all that often. But having the right math background sometimes comes in handy and I really enjoy learning math. YMMV.

Bertsekas; Introduction to Probability

Introductory undergrad text that tends towards intuitive explanations over epsilon-delta rigor. For anyone who cares to do more rigorous derivations, there are some exercises at the back of the book that go into more detail.

Has many exercises with available solutions, making this a good text for self-study.

Ross; A First Course in Probability

This is one of those books where they regularly crank out new editions to make students pay for new copies of the book (this is presently priced at a whopping $174 on Amazon)2. This was the standard text when I took probability at Wisconsin, and I literally cannot think of a single person who found it helpful. Avoid.

Brualdi; Introductory Combinatorics

Brualdi is a great lecturer, one of the best I had in undergrad, but this book was full of errors and not particularly clear. There have been two new editions since I used this book, but according to the Amazon reviews the book still has a lot of errors.

For an alternate introductory text, I've heard good things about Camina & Lewis's book, but I haven't read it myself. Also, Lovasz is a great book on combinatorics, but it's not exactly introductory.

Apostol; Calculus

Volume 1 covers what you'd expect in a calculus I + calculus II book. Volume 2 covers linear algebra and multivariable calculus. It covers linear algebra before multivariable calculus, which makes multi-variable calculus a lot easier to understand.

It also makes a lot of sense from a programming standpoint, since a lot of the value I get out of calculus is its applications to approximations, etc., and that's a lot clearer when taught in this sequence.

This book is probably a rough intro if you don't have a professor or TA to help you along. The Spring SUMS series tends to be pretty good for self-study introductions to various areas, but I haven't actually read their intro calculus book so I can't actually recommend it.

Stewart; Calculus

Another one of those books where they crank out new editions with trivial changes to make money. This was the standard text for non-honors calculus at Wisconsin, and the result of that was I taught a lot of people to do complex integrals with the methods covered in Apostol, which are much more intuitive to many folks.

This book takes the approach that, for a type of problem, you should pattern match to one of many possible formulas and then apply the formula. Apostol is more about teaching you a few tricks and some intuition that you can apply to a wide variety of problems. I'm not sure why you'd buy this unless you were required to for some class.

Hardware basics

Why should you care? People often claim that, to be a good programmer, you have to understand every abstraction you use. That's nonsense. Modern computing is too complicated for any human to have a real full-stack understanding of what's going on. In fact, one reason modern computing can accomplish what it does is that it's possible to be productive without having a deep understanding of much of the stack that sits below the level you're operating at.

That being said, if you're curious about what sits below software, here are a few books that will get you started.

Nisan & Shocken; nand2tetris

If you only want to read one single thing, this should probably be it. It's a “101” level intro that goes down to gates and Boolean logic. As implied by the name, it takes you from NAND gates to a working tetris program.

Roth; Fundamentals of Logic Design

Much more detail on gates and logic design than you'll see in nand2tetris. The book is full of exercises and appears to be designed to work for self-study. Note that the link above is to the 5th edition. There are newer, more expensive, editions, but they don't seem to be much improved, have a lot of errors in the new material, and are much more expensive.

Weste; Harris, and Bannerjee; CMOS VLSI Design

One level below Boolean gates, you get to VLSI, a historical acronym (very large scale integration) that doesn't really have any meaning today.

Broader and deeper than the alternatives, with clear exposition. Explores the design space (e.g., the section on adders doesn't just mention a few different types in an ad hoc way, it explores all the tradeoffs you can make. Also, has both problems and solutions, which makes it great for self study.

Kang & Leblebici; CMOS Digital Integrated Circuits

This was the standard text at Wisconsin way back in the day. It was hard enough to follow that the TA basically re-explained pretty much everything necessary for the projects and the exams. I find that it's ok as a reference, but it wasn't a great book to learn from.

Compared to West et al., Weste spends a lot more effort talking about tradeoffs in design (e.g., when creating a parallel prefix tree adder, what does it really mean to be at some particular point in the design space?).

Pierret; Semiconductor Device Fundamentals

One level below VLSI, you have how transistors actually work.

Really beautiful explanation of solid state devices. The text nails the fundamentals of what you need to know to really understand this stuff (e.g., band diagrams), and then uses those fundamentals along with clear explanations to give you a good mental model of how different types of junctions and devices work.

Streetman & Bannerjee; Solid State Electronic Devices

Covers the same material as Pierret, but seems to substitute mathematical formulas for the intuitive understanding that Pierret goes for.

Ida; Engineering Electromagnetics

One level below transistors, you have electromagnetics.

Two to three times thicker than other intro texts because it has more worked examples and diagrams. Breaks things down into types of problems and subproblems, making things easy to follow. For self-study, A much gentler introduction than Griffiths or Purcell.

Shanley; Pentium Pro and Pentium II System Architecture

Unlike the other books in this section, this book is about practice instead of theory. It's a bit like Windows Internals, in that it goes into the details of a real, working, system. Topics include hardware bus protocols, how I/O actually works (e.g., APIC), etc.

The problem with a practical introduction is that there's been an exponential increase in complexity ever since the 8080. The further back you go, the easier it is to understand the most important moving parts in the system, and the more irrelevant the knowledge. This book seems like an ok compromise in that the bus and I/O protocols had to handle multiprocessors, and many of the elements that are in modern systems were in these systems, just in a simpler form.

Not covered

Of the books that I've liked, I'd say this captures at most 25% of the software books and 5% of the hardware books. On average, the books that have been left off the list are more specialized. This list is also missing many entire topic areas, like PL, practical books on how to learn languages, networking, etc.

The reasons for leaving off topic areas vary; I don't have any PL books listed because I don't read PL books. I don't have any networking books because, although I've read a couple, I don't know enough about the area to really say how useful the books are. The vast majority of hardware books aren't included because they cover material that you wouldn't care about unless you were a specialist (e.g., Skew-Tolerant Circuit Design or Ultrafast Optics). The same goes for areas like math and CS theory, where I left off a number of books that I think are great but have basically zero probability of being useful in my day-to-day programming life, e.g., Extremal Combinatorics. I also didn't include books I didn't read all or most of, unless I stopped because the book was atrocious. This means that I don't list classics I haven't finished like SICP and The Little Schemer, since those book seem fine and I just didn't finish them for one reason or another.

This list also doesn't include many books on history and culture, like Inside Intel or Masters of Doom. I'll probably add more at some point, but I've been trying an experiment where I try to write more like Julia Evans (stream of consciousness, fewer or no drafts). I'd have to go back and re-read the books I read 10+ years ago to write meaningful comments, which doesn't exactly fit with the experiment. On that note, since this list is from memory and I got rid of almost all of my books a couple years ago, I'm probably forgetting a lot of books that I meant to add.

_If you liked this, you might also like Thomas Ptacek's Application Security Reading List or this list of programming blogs, which is written in a similar style_

_Thanks to @tytrdev for comments/corrections/discussion.


  1. Also, if you play board games, auction theory explains why fixing game imbalance via an auction mechanism is non-trivial and often makes the game worse. [return]
  2. I talked to the author of one of these books. He griped that the used book market destroys revenue from textbooks after a couple years, and that authors don't get much in royalties, so you have to charge a lot of money and keep producing new editions every couple of years to make money. That griping goes double in cases where a new author picks up a book classic book that someone else originally wrote, since the original author often has a much larger share of the royalties than the new author, despite doing no work no the later editions. [return]

Hiring and the market for lemons

2016-10-09 17:44:14

Joel Spolsky has a classic blog post on "Finding Great Developers" where he popularized the meme that great developers are impossible to find, a corollary of which is that if you can find someone, they're not great. Joel writes,

The great software developers, indeed, the best people in every field, are quite simply never on the market.

The average great software developer will apply for, total, maybe, four jobs in their entire career.

...

If you're lucky, if you're really lucky, they show up on the open job market once, when, say, their spouse decides to accept a medical internship in Anchorage and they actually send their resume out to what they think are the few places they'd like to work at in Anchorage.

But for the most part, great developers (and this is almost a tautology) are, uh, great, (ok, it is a tautology), and, usually, prospective employers recognize their greatness quickly, which means, basically, they get to work wherever they want, so they honestly don't send out a lot of resumes or apply for a lot of jobs.

Does this sound like the kind of person you want to hire? It should.The corollary of that rule--the rule that the great people are never on the market--is that the bad people--the seriously unqualified--are on the market quite a lot. They get fired all the time, because they can't do their job. Their companies fail--sometimes because any company that would hire them would probably also hire a lot of unqualified programmers, so it all adds up to failure--but sometimes because they actually are so unqualified that they ruined the company. Yep, it happens.

These morbidly unqualified people rarely get jobs, thankfully, but they do keep applying, and when they apply, they go to Monster.com and check off 300 or 1000 jobs at once trying to win the lottery.

Astute readers, I expect, will point out that I'm leaving out the largest group yet, the solid, competent people. They're on the market more than the great people, but less than the incompetent, and all in all they will show up in small numbers in your 1000 resume pile, but for the most part, almost every hiring manager in Palo Alto right now with 1000 resumes on their desk has the same exact set of 970 resumes from the same minority of 970 incompetent people that are applying for every job in Palo Alto, and probably will be for life, and only 30 resumes even worth considering, of which maybe, rarely, one is a great programmer. OK, maybe not even one.

Joel's claim is basically that "great" developers won't have that many jobs compared to "bad" developers because companies will try to keep "great" developers. Joel also posits that companies can recognize prospective "great" developers easily. But these two statements are hard to reconcile. If it's so easy to identify prospective "great" developers, why not try to recruit them? You could just as easily make the case that "great" developers are overrepresented in the market because they have better opportunities and it's the "bad" developers who will cling to their jobs. This kind of adverse selection is common in companies that are declining; I saw that in my intern cohort at IBM1, among other places.

Should "good" developers be overrepresented in the market or underrepresented? If we listen to the anecdotal griping about hiring, we might ask if the market for developers is a market for lemons. This idea goes back to Akerlof's Nobel prize winning 1970 paper, "The Market for 'Lemons': Quality Uncertainty and the Market Mechanism". Akerlof takes used car sales as an example, splitting the market into good used cars and bad used cars (bad cars are called "lemons"). If there's no way to distinguish between good cars and lemons, good cars and lemons will sell for the same price. Since buyers can't distinguish between good cars and bad cars, the price they're willing to pay is based on the quality of the average in the market. Since owners know if their car is a lemon or not, owners of non-lemons won't sell because the average price is driven down by the existence of lemons. This results in a feedback loop which causes lemons to be the only thing available.

This model is certainly different from Joel's model. Joel's model assumes that "great" developers are sticky -- that they stay at each job for a long time. This comes from two assumptions; first, that it's easy for prospective employers to identify who's "great", and second, that once someone is identified as "great", their current employer will do anything to keep them (as in the market for lemons). But the first assumption alone is enough to prevent the developer job market from being a market for lemons. If you can tell that a potential employee is great, you can simply go and offer them twice as much as they're currently making (something that I've seen actually happen). You need an information asymmetry to create a market for lemons, and Joel posits that there's no information asymmetry.

If we put aside Joel's argument and look at the job market, there's incomplete information, but both current and prospective employers have incomplete information, and whose information is better varies widely. It's actually quite common for prospective employers to have better information than current employers!

Just for example, there's someone I've worked with, let's call him Bob, who's saved two different projects by doing the grunt work necessary to keep the project from totally imploding. The projects were both declared successes, promotions went out, they did a big PR blitz which involves seeding articles in all the usual suspects; Wired, Fortune, and so on and so forth. That's worked out great for the people who are good at taking credit for things, but it hasn't worked out so well for Bob. In fact, someone else I've worked with recently mentioned to me that management keeps asking him why Bob takes so long to do simple tasks. The answer is that Bob's busy making sure the services he works on don't have global outages when they launch, but that's not the kind of thing you get credit for in Bob's org. The result of that is that Bob has a network who knows that he's great, which makes it easy for him to get a job anywhere else at market rate. But his management chain has no idea, and based on what I've seen of offers today, they're paying him about half what he could make elsewhere. There's no shortage of cases where information transfer inside a company is so poor that external management has a better view of someone's productivity than internal management. I have one particular example in mind, but if I just think of the Bob archetype, off the top of my head, I know of four people who are currently in similar situations. It helps that I currently work at a company that's notorious for being dysfunctional in this exact way, but this happens everywhere. When I worked at a small company, we regularly hired great engineers from big companies that were too clueless to know what kind of talent they had.

Another problem with the idea that "great" developers are sticky is that this assumes that companies are capable of creating groups that developers want to work for on demand. This is usually not the case. Just for example, I once joined a team where the TL was pretty strongly against using version control or having tests. As a result of those (and other) practices, it took five devs one year to produce 10k lines of kinda-sorta working code for a straightforward problem. Additionally, it was a pressure cooker where people were expected to put in 80+ hour weeks, where the PM would shame people into putting in longer hours. Within a year, three of the seven people who were on the team when I joined had left; two of them went to different companies. The company didn't want to lose those two people, but it wasn't capable of creating an environment that would keep them.

Around when I joined that team, a friend of mine joined a really great team. They do work that materially impacts the world, they have room for freedom and creativity, a large component of their jobs involves learning new and interesting things, and so on and so forth. Whenever I heard about someone who was looking for work, I'd forward them that team. That team is now full for the foreseeable future because everyone whose network included that team forwarded people into that team. But if you look at the team that lost three out of seven people in a year, that team is hiring. A lot. The result of this dynamic is that, as a dev, if you join a random team, you're overwhelmingly likely to join a team that has a lot of churn. Additionally, if you know of a good team, it's likely to be full.

Joel's model implicitly assumes that, proportionally, there are many more dysfunctional developers than dysfunctional work environments.

At the last conference I attended, I asked most people I met two questions:

  1. Do you know of any companies that aren't highly dysfunctional?
  2. Do you know of any particular teams that are great and are hiring?

Not one single person told me that their company meets the criteria in (1). A few people suggested that, maybe, Dropbox is ok, or that, maybe, Jane Street is ok, but the answers were of the form "I know a few people there and I haven't heard any terrible horror stories yet, plus I sometimes hear good stories", not "that company is great and you should definitely work there". Most people said that they didn't know of any companies that weren't a total mess.

A few people had suggestions for (2), but the most common answer was something like "LOL no, if I knew that I'd go work there". The second most common answer was of the form "I know some people on the Google Brain team and it sounds great". There are a few teams that are well known for being great places to work, but the fact that they're so few and far between that it's basically impossible to get a job on one of those teams. A few people knew of actual teams that they'd strongly recommend who were hiring, but that was rare. Much rarer than finding a developer who I'd want to work with who would consider moving. If I flipped the question around and asked if they knew of any good developers who were looking for work, the answer was usually "yes"2.

Another problem with the idea that "great" developers are impossible to find because they join companies and then stick is that developers (and companies) aren't immutable. Because I've been lucky enough to work in environments that allow people to really flourish, I've seen a lot of people go from unremarkable to amazing. Because most companies invest pretty much nothing in helping people, you can do really well here without investing much effort.

On the flip side, I've seen entire teams of devs go on the market because their environment changed. Just for example, I used to know a lot of people who worked at company X under Marc Yun. It was the kind of place that has low attrition because people really enjoy working there. And then Marc left. Over the next two years, literally everyone I knew who worked there left. This one change both created a lemon in the searching-for-a-team job market and put a bunch of good developers on the market. This kind of thing happens all the time, even more now than in the past because of today's acquisition-heavy environment.

Is developer hiring a market for lemons? Well, it depends on what you mean by that. Both developers and hiring managers have incomplete information. It's not obvious if having a market for lemons in one direction makes the other direction better or worse. The fact that joining a new team is uncertain makes developers less likely to leave existing teams, which makes it harder to hire developers. But the fact that developers often join teams which they dislike makes it easier to hire developers. What's the net effect of that? I have no idea.

From where I'm standing, it seems really hard to find a good manager/team, and I don't know of any replicable strategy for doing so; I have a lot of sympathy for people who can't find a good fit because I get how hard that is. But I have seen replicable strategies for hiring, so I don't have nearly as much sympathy for hiring managers who complain that hiring "great" developers is impossible.

When a hiring manager complains about hiring, in every single case I've seen so far, the hiring manager has one of the following problems:

  1. They pay too little. The last time I went looking for work, I found a 6x difference in compensation between companies who might hire me in the same geographic region. Basically all of the companies thought that they were competitive, even when they were at the bottom end of the range. I don't know what it is, but companies always seem to think that they pay well, even when they're not even close to being in the right range. Almost everyone I talk to tells me that they pay as much as any reasonable company. Sure, there are some companies out there that pay a bit more, but they're overpaying! You can actually see this if you read Joel's writing -- back when he wrote the post I'm quoting above, he talked about how well Fog Creek paid. A couple years later, he complained that Google was overpaying for college kids with no experience, and more recently he's pretty much said that you don't want to work at companies that pay well.

  2. They pass on good or even "great" developers3. Earlier, I claimed that I knew lots of good developers who are looking for work. You might ask, if there are so many good developers looking for work, why's it so hard to find them? Joel claims that out of a 1000 resumes, maybe 30 people will be "solid" and 970 will be "incompetent". It seems to me it's more like 400 will be solid and 20 will be really good. It's just that almost everyone uses the same filters, so everyone ends up fighting over the 30 people who they think are solid. When people do randomized trials on what actaully causes resumes to get filtered out, it often turns out that traits that are tangentially related or unrelated to job performance make huge differences. For example, in this study of law firm recruiting, the authors found that a combination of being male and having "high-class" signifiers on the resume (sailing, polo, and classical music instead of track and field, pick-up soccer, and country music) with no other changes caused a 4x increase in interview invites.

    The first company I worked at, Centaur, had an onsite interview process that was less stringent than the phone screen at places like Google and Facebook. If you listen to people like Joel, you'd think that Centaur was full of bozos, but after over a decade in industry (including time at Google), Centaur had the best mean and median level of developer productivity of any place I've worked.

    Matasano famously solved their hiring problem by using a different set of filters and getting a different set of people. Despite the resounding success of their strategy, pretty much everyone insists on sticking with the standard strategy of picking people with brand name pedigrees and running basically the same interview process as everyone else, bidding up the price of folks who are trendy and ignoring everyone else.

    If I look at developers I know who are in high-demand today, a large fraction of them went through a multi-year period where they were underemployed and practically begging for interesting work. These people are very easy to hire if you can find them.

  3. They're trying to hire for some combination of rare skills. Right now, if you're trying to hire for someone with experience in deep learning and, well, anything else, you're going to have a bad time.

  4. They're much more dysfunctional than they realize. I know one hiring manager who complains about how hard it is to hire. What he doesn't realize is that literally everyone on his team is bitterly unhappy and a significant fraction of his team gives anti-referrals to friends and tells them to stay away.

    That's an extreme case, but it's quite common to see a VP or founder baffled by why hiring is so hard when employees consider the place to be mediocre or even bad.

Of these problems, (1), low pay, is both the most common and the simplest to fix.

In the past few years, Oracle and Alibaba have spun up new cloud computing groups in Seattle. This is a relatively competitive area, and both companies have reputations that work against them when hiring4. If you believe the complaints about how hard it is to hire, you wouldn't think one company, let alone two, could spin up entire cloud teams in Seattle. Both companies solved the problem by paying substantially more than their competitors were offering for people with similar experience. Alibaba became known for such generous offers that when I was negotiating my offer from Microsoft, MS told me that they'd match an offer from any company except Alibaba. I believe Oracle and Alibaba have hired hundreds of engineers over the past few years.

Most companies don't need to hire anywhere near a hundreds of people; they can pay competitively without hiring so many developers that the entire market moves upwards, but they still refuse to do so, while complaining about how hard it is to hire.

(2), filtering out good potential employees, seems like the modern version of "no one ever got fired for hiring IBM". If you hire someone with a trendy background who's good at traditional coding interviews and they don't work out, who could blame you? And no one's going to notice all the people you missed out on. Like (1), this is something that almost everyone thinks they do well and they'll say things like "we'd have to lower our bar to hire more people, and no one wants that". But I've never worked at a place that doesn't filter out a lot of people who end up doing great work elsewhere. I've tried to get underrated programmers5 hired at places I've worked, and I've literally never succeeded in getting one hired. Once, someone I failed to get hired managed to get a job at Google after something like four years being underemployed (and is a star there). That guy then got me hired at Google. Not hiring that guy didn't only cost them my brilliant friend, it eventually cost them me!

BTW, this illustrates a problem with Joel's idea that "great" devs never apply for jobs. There's often a long time period where a "great" dev has an extremely hard time getting hired, even through their network who knows that they're great, because they don't look like what people think "great" developers look like. Additionally, Google, which has heavily studied which hiring channels give good results, has found that referrals and internal recommendations don't actually generate much signal. While people will refer "great" devs, they'll also refer terrible ones. The referral bonus scheme that most companies set up skews incentives in a way that makes referrals worse than you might expect. Because of this and other problems, many companies don't weight referrals particularly heavily, and "great" developers still go through the normal hiring process, just like everyone else.

(3), needing a weird combination of skills, can be solved by hiring people with half or a third of the expertise you need and training people. People don't seem to need much convincing on this one, and I see this happen all the time.

(4), dysfunction seems hard to fix. If I knew how to do that, I'd be manager.

As a dev, it seems to me that teams I know of that are actually good environments that pay well have no problems hiring, and that teams that have trouble hiring can pretty easily solve that problem. But I'm biased. I'm not a hiring manager. There's probably some hiring manager out there thinking: "every developer I know who complains that it's hard to find a good team has one of these four obvious problems; if only my problems were that easy to solve!"

Thanks to Leah Hanson, David Turner, Tim Abbott, Vaibhav Sagar, Victor Felder, Ezekiel Smithburg, Juliano Bortolozzo Solanho, Stephen Tu, Pierre-Yves Baccou, Jorge Montero, Ben Kuhn, and Lindsey Kuper for comments and corrections.

If you liked this post, you'd probably enjoy this other post on the bogosity of claims that there can't possibly be discrimination in tech hiring.


  1. The folks who stayed describe an environment that's mostly missing mid-level people they'd want to work with. There are lifers who've been there forever and will be there until retirement, and there are new grads who land there at random. But, compared to their competitors, there are relatively few people people with 5-15 years of experience. The person I knew who lasted the longest stayed until the 8 year mark, but he started interviewing with an eye on leaving when he found out the other person on his team who was competent was interviewing; neither one wanted to be the only person on the team doing any work, so they raced to get out the door first. [return]
  2. This section kinda makes it sound like I'm looking for work. I'm not looking for work, although I may end up forced into it if my partner takes a job outside of Seattle. [return]
  3. Moishe Lettvin has a talk I really like, where he talks about a time when he was on a hiring committee and they rejected every candidate that came up, only to find that the "candidates" were actually anonymized versions of their own interviews!

    The bit about when he first started interviewing at Microsoft should sound familiar to MS folks. As is often the case, he got thrown into the interview with no warning and no preparation. He had no idea what to do and, as a result, wrote up interview feedback that wasn't great. "In classic Microsoft style", his manager forwarded the interview feedback to the entire team and said "don't do this". "In classic Microsoft style" is a quote from Moishe, but I've observed the same thing. I'd like to talk about how we have a tendency to do extremely blameful postmortems and how that warps incentives, but that probably deserves its own post.

    Well, I'll tell one story, in remembrance of someone who recently left my former team for Google. Shortly after that guy joined, he was in the office on a weekend (a common occurrence on his team). A manager from another team pinged him on chat and asked him to sign off on some code from the other team. The new guy, wanting to be helpful, signed off on the code. On Monday, the new guy talked to his mentor and his mentor suggested that he not help out other teams like that. Later, there was an outage related to the code. In classic Microsoft style, the manager from the other team successfully pushed the blame for the outage from his team to the new guy.

    Note that this guy isn't included in my 3/7 stat because he joined shortly after I did, and I'm not trying to cherry pick a window with the highest possible attrition.

    [return]
  4. For a while, Oracle claimed that the culture of the Seattle office is totally different from mainline-Oracle culture, but from what I've heard, they couldn't resist Oracle-ifying the Seattle group and that part of the pitch is no longer convincing. [return]
  5. This footnote is a response to Ben Kuhn, who asked me, what types of devs are underrated and how would you find them? I think this group is diverse enough that there's no one easy way to find them. There are people like "Bob", who do critical work that's simply not noticed. There are also people who are just terrible at interviewing, like Jeshua Smith. I believe he's only once gotten a performance review that wasn't excellent (that semester, his manager said he could only give out one top rating, and it wouldn't be fair to give it to only one of his two top performers, so he gave them both average ratings). In every place he's worked, he's been well known as someone who you can go to with hard problems or questions, and much higher ranking engineers often go to him for help. I tried to get him hired at two different companies I've worked at and he failed both interviews. He sucks at interviews. My understanding is that his interview performance almost kept him from getting his current job, but his references were so numerous and strong that his current company decided to take a chance on him anyway. But he only had those references because his old org has been disintegrating. His new company picked up a lot of people from his old company, so there were many people at the new company that knew him. He can't get the time of day almost anywhere else. Another person I've tried and failed to get hired is someone I'll call Ashley, who got rejected in the recruiter screening phase at Google for not being technical enough, despite my internal recommendation that she was one of the strongest programmers I knew. But she came from a "nontraditional" background that didn't fit the recruiter's idea of what a programmer looked like, so that was that. Nontraditional is a funny term because it seems like most programmers have a "nontraditional" background, but you know what I mean.

    There's enough variety here that there isn't one way to find all of these people. Having a filtering process that's more like Matasano's and less like Google, Microsoft, Facebook, almost any YC startup you can name, etc., is probably a good start.

    [return]

I could do that in a weekend!

2016-10-03 16:14:27

I can't think of a single large software company that doesn't regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?

A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google's $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn't a trivial problem. But in the comments on Alex's posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google's capabilities in the next few years. It's been long enough since then that we can look back and say that Lucene hasn't improved so much that Google is in danger from a startup that puts together a Lucene cluster. If anything, the cost of creating a viable competitor to Google search has gone up.

For making a viable Google competitor, I believe that ranking is a harder problem than indexing, but even if we just look at indexing, there are individual domains that contain on the order of one trillion pages we might want to index (like Twitter) and I'd guess that we can find on the order a trillion domains. If you try to configure any off-the-shelf search index to hold an index of some number of trillions of items to handle a load of, say, 1/100th Google's load, with a latency budget of, say, 100ms (most of the latency should be for ranking, not indexing), I think you'll find that this isn't trivial. And if you use Google to search Twitter, you can observe that, at least for select users or tweets, Google indexes Twitter quickly enough that it's basically real-time from the standpoint of users. Anyone who's tried to do real-time indexing with Lucene on a large corpus under high load will also find this to be non-trivial. You might say that this isn't totally fair since it's possible to find tweets that aren't indexed by major search engines, but if you want to make a call on what to index or not, well, that's also a problem that's non trivial in the general case. And we're only talking about indexing here, indexing is one of the easier parts of building a search engine.

Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There's also a wide body of research that's found that decreasing latency has a significant effect on revenue over a pretty wide range of latencies for some businesses. Increasing performance also has the benefit of reducing costs. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.

And that's just performance. Features also matter: when I talk to engineers working on basically any product at any company, they'll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.

Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l'foo” should often match “un foo” and vice versa, but American search engines from the 90s didn't actually handle that correctly. How about tokenizing Chinese queries, where words don't have spaces between them, and sentences don't have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? And that's not even the most complicated part of handling Arabic! It's fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.

There's also security! If you don't “bloat” your company by hiring security people, you'll end up like hotmail or yahoo, where your product is better known for how often it's hacked than for any of its other features.

Everything we've looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they're saying to what my coworkers think they're saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years1. When people imagine how long it should take to build something, they're often imagining a team that works perfectly and spends 100% of its time coding. But that's impossible to scale up. The question isn't whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.

This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google's capabilities any day now, but there's nothing about this post that's unique to search. If you talk to people in almost any field, you'll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn't that it would be impossible for a small team to build something better than Google search. It's entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google's, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don't even realize that are there, each of which provides a noticeable increase in revenue. It's "obvious" that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can't get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there's sales, which most engineers don't even think of2; the exact same line of reasoning that applies to optimization also applies to sales -- as long as marginal benefit of adding another salesperson exceeds the cost, you should expect the company to keep adding salespeople, which can often result in a sales force that's larger than the engineering team. There's also research which, almost by definition, involves a lot of bets that don't pan out!

It's not that all of those things are necessary to run a service at all; it's that almost every large service is leaving money on the table if they don't seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. For reliable systems, error handling is more work than the happy path. The same thing is true for large services -- all of this stuff that people don't think of as “real” work is more work than the core service3.

Correction

I often make minor tweaks and add new information without comment, but the original version of this post had an error and removing the error was a large enough change that I believe it's worth pointing out the change. I had a back of the envelope calculation on the cost of indexing the web with Lucene, but the numbers were based on benchmarks results from some papers and comments from people who work on a commercial search engine. When I tried to reproduce the results from the papers, I found that it was trivial to get orders of magnitude better performance than reported in one paper and when I tried to track down the underlying source for the comments by people who work on a commercial search engine, I found that there was no experimental evidence underlying the comments, so I removed the example.

I'm experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post and my last post were written that way. Let me know what you think of these posts relative to my “normal” posts!

Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, Heath Borders, Kris Shamloo, Justin Blank, and Ivar Refsdal for corrections.


  1. Recently, I was curious why an org that's notorious for producing unreliable services produces so many unreliable services. When I asked around about why, I found that that upper management were afraid of sending out any sort of positive message about reliability because they were afraid that people would use that as an excuse to slip schedules. Upper management changed their message to include reliability about a year ago, but if you talk to individual contributors, they still believe that the message is that features are the #1 priority and slowing down on features to make things more reliable is bad for your career (and based on who's getting promoted the individual contributors appear to be right). Maybe in another year, the org will have really gotten the message through to the people who hand out promotions, and in another couple of years, enough software will have been written with reliability in mind that they'll actually have reliable services. Maybe. That's just the first-order effect. The second-order effect is that their policies have caused a lot of people who care about reliability to go to companies that care more about reliability and less about demo-ing shiny new features. They might be able to fix that in a decade. Maybe. That's made harder by the fact that the org is in a company that's well known for having PMs drive features above all else. If that reputation is possible to change, it will probably take multiple decades. [return]
  2. For a lot of products, the sales team is more important than the engineering team. If we build out something rivaling Google search, we'll probably also end up with the infrastructure required to sell a competitive cloud offering. Google actually tried to do that without having a serious enterprise sales force and the result was that AWS and Azure basically split the enterprise market between them. [return]
  3. This isn't to say that there isn't waste or that different companies don't have different levels of waste. I see waste everywhere I look, but it's usually not what people on the outside think of as waste. Whenever I read outsider's descriptions of what's wasteful at the companies I've worked at, they're almost inevitably wrong. Friends of mine who work at other places also describe the same dynamic. [return]

Is dev compensation bimodal?

2016-09-27 14:33:26

Developer compensation has skyrocketed since the demise of the Google et al. wage-suppressing no-hire agreement, to the point where compensation rivals and maybe even exceeds compensation in traditionally remunerative fields like law, consulting, etc. In software, "senior" dev salary at a high-paying tech company is $350k/yr, where "senior" can mean "someone three years of out school" and it's not uncommon for someone who's considered a high performing engineer to make seven figures.

The fields have sharply bimodal income distributions. Are programmers in for the same fate? Let's see what data we can find. First, let's look at data from the National Association for Law Placement, which shows when legal salaries become bimodal.

Lawyers in 1991

First-year lawyer salaries in 1991. $40k median, trailing off with the upper end just under $90k

Median salary is $40k, with the numbers slowly trickling off until about $90k. According to the BLS $90k in 1991 is worth $160k in 2016 dollars. That's a pretty generous starting salary.

Lawyers in 2000

First-year lawyer salaries in 2000. $50k median; bimodal with peaks at $40k and $125k

By 2000, the distribution had become bimodal. The lower peak is about the same in nominal (non-inflation-adjusted) terms, putting it substantially lower in real (inflation-adjusted) terms, and there's an upper peak at around $125k, with almost everyone coming in under $130k. $130k in 2000 is $180k in 2016 dollars. The peak on the left has moved from roughly $30k in 1991 dollars to roughly $40k in 2000 dollars; both of those translate to roughly $55k in 2016 dollars. People in the right mode are doing better, while people in the left mode are doing about the same.

I won't belabor the point with more graphs, but if you look at more recent data, the middle area between the two modes has hollowed out, increasing the level of inequality within the field. As a profession, lawyers have gotten hit hard by automation, and in real terms, 95%-ile offers today aren't really better than they were in 2000. But 50%-ile and even 75%-ile offers are worse off due to the bimodal distribution.

Programmers in 2015

Enough about lawyers! What about programmers? Unfortunately, it's hard to get good data on this. Anecdotally, it sure seems to me like we're going down the same road. Unfortunately, almost all of the public data sources that are available, like H1B data, have salary numbers and not total compensation numbers. Since compensation at the the upper end is disproportionately bonus and stock, most data sets I can find don't capture what's going on.

One notable exception is the new grad compensation data recorded by Dan Zhang and Jesse Collins:

First-year programmer compensation in 2016. Compensation ranges from $50k to $250k

There's certainly a wide range here, and while it's technically bimodal, there isn't a huge gulf in the middle like you see in law and business. Note that this data is mostly bachelors grads with a few master's grads. PhD numbers, which sometimes go much higher, aren't included.

Do you know of a better (larger) source of data? This is from about 100 data points, members of the "Hackathon Hackers" Facebook group, in 2015. Dan and Jesse also have data from 2014, but it would be nice to get data over a wider timeframe and just plain more data. Also, this data is pretty clearly biased towards the high end — if you look at national averages for programmers at all levels of experience, the average comes in much lower than the average for new grads in this data set. The data here match the numbers I hear when we compete for people, but the population of "people negotiating offers at Microsoft" also isn't representative.

If we had more representative data it's possible that we'd see a lot more data points in the $40k to $60k range along with the data we have here, which would make the data look bimodal. It's also possible that we'd see a lot more points in the $40k to $60k range, many more in the $70k to $80k range, some more in the $90k+ range, etc., and we'd see a smooth drop-off instead of two distinct modes.

Stepping back from the meager data we have and looking at the circumstances, "should" programmer compensation be bimodal? Most other fields that have bimodal compensation have a very different compensation structure than we see in programming. For example, top law and consulting firms have an up-or-out structure, which is effectively a tournament, which distorts compensation and certainly makes it seem more likely that compensation is likely to end up being bimodal. Additionally, competitive firms pay the same rate to all 1st year employees, which they determine by matching whoever appears to be paying the most. For example, this year, Cravath announced that it would pay first-year associates $180k, and many other firms followed suit. Like most high-end firms, Cravath has a salary schedule that's entirely based on experience:

  • 0 years: $180k
  • 1 year: $190k
  • 2 years: $210k
  • 3 years: $235k
  • 4 years: $260k
  • 5 years: $280k
  • 6 years: $300k
  • 7 years: $315k

In software, compensation tends to be on a case-by-case basis, which makes it much less likely that we'll see a sharp peak the way we do in law. If I had to guess, I'd say that while the dispersion in programmer compensation is increasing, it's not bimodal, but I don't really have the right data set to conclusively say anything. Please point me to any data you have that's better.

Appendix A: please don't send me these

  • H-1B: mostly salary only.
  • Stack Overflow survey: salary only. Also, data is skewed by the heavy web focus of the survey —— I stopped doing the survey when none of their job descriptions matched anyone in my entire building, and I know other people who stopped for the same reason.
  • Glassdoor: weirdly inconsistent about whether or not it includes stock compensation. Numbers for some companies seem to, but numbers for other companies don't.
  • O'Reilly survey: salary focused.
  • BLS: doesn't make fine-grained distribution available.
  • IRS: they must have the data, but they're not sharing.
  • IDG: only has averages.
  • internal company data: too narrow
  • compensation survey companies like PayScale: when I've talked to people from these companies, they acknowledge that they have very poor visibility into large company compensation, but that's what drives the upper end of the market (outside of finance).
  • #talkpay on twitter: numbers skew low1.

Appendix B: why are programmers well paid?

Since we have both programmer and lawyer compensation handy, let's examine that. Programming pays so well that it seems a bit absurd. If you look at other careers with similar compensation, there are multiple factors that act as barriers or disincentives to entry.

If you look at law, you have to win the prestige lottery and get into a top school, which will cost hundreds of thousands of dollars (while it's possible to get a full scholarship, a relatively small fraction of students at top schools are on full scholarships). Then you have to win the grades lottery and get good enough grades to get into a top firm. And then you have to continue winning tournaments to avoid getting kicked out, which requires sacrificing any semblance of a personal life. Consulting, investment banking, etc., are similar. Compensation appears to be proportional to the level of sacrifice (e.g., investment bankers are paid better, but work even longer hours than lawyers, private equity is somewhere between investment and banking and law in hours and compensation, etc.).

Medicine seems to be a bit better from the sacrifice standpoint because there's a cartel which limits entry into the field, but the combination of medical school and residency is still incredibly brutal compared to most jobs at places like Facebook and Google.

Programming also doesn't have a licensing body limiting the number of programmers, nor is there the same prestige filter where you have to go to a top school to get a well paying job. Sure, there are a lot of startups who basically only hire from MIT, Stanford, CMU, and a few other prestigious schools, and I see job ads like the following whenever I look at startups (the following is from a company that was advertising on Slate Star Codex for quite a long time):

Our team of 14 includes 6 MIT alumni, 3 ex-Googlers, 1 Wharton MBA, 1 MIT Master in CS, 1 CMU CS alum, and 1 "20 under 20" Thiel fellow. Candidates often remark we're the strongest team they've ever seen.

We’re not for everyone. We’re an enterprise SaaS company your mom will probably never hear of. We work really hard 6 days a week because we believe in the future of mobile and we want to win.

Prestige obsessed places exist. But, in programming, measuring people by markers of prestige seems to be a Silicon Valley startup thing and not a top-paying companies thing. Big companies, which pay a lot better than startups, don't filter people out by prestige nearly as often. Not only do you not need the right degree from the right school, you also don't need to have the right kind of degree, or any degree at all. Although it's getting rarer to not have a degree, I still meet new hires with no experience and either no degree or a degree in an unrelated field (like sociology or philosophy).

How is it possible that programmers are paid so well without these other barriers to entry that similarly remunerative fields have? One possibility is that we have a shortage of programmers. If that's the case, you'd expect more programmers to enter the field, bringing down compensation. CS enrollments have been at record levels recently, so this may already be happening. Another possibility is that programming is uniquely hard in some way, but that seems implausible to me. Programming doesn't seem inherently harder than electrical engineering or chemical engineering and it certainly hasn't gotten much harder over the past decade, but during that timeframe, programming has gone from having similar compensation to most engineering fields to paying much better. The last time I was negotiating with a EE company about offers, they remarked to me that their VPs don't make as much as I do, and I work at a software company that pays relatively poorly compared to its peers. There's no reason to believe that we won't see a flow of people from engineering fields into programming until compensation is balanced.

Another possibility is that U.S. immigration laws act as a protectionist barrier to prop up programmer compensation. It seems impossible for this to last (why shouldn't there by really valuable non-U.S. companies), but it does appear to be somewhat true for now. When I was at Google, one thing that was remarkable to me was that they'd pay you approximately the same thing in Washington or Colorado as they do Silicon Valley, but they'd pay you much less in London. Whenever one of these discussions comes up, people always bring up the "fact" that SV salaries aren't really as good as they sound because the cost of living is so high, but companies will not only match SV offers in Seattle, they'll match them in places like Pittsburgh. My best guess for why this happens is that someone in the Midwest can credibly threaten to move to SV and take a job at any company there, whereas someone in London can't2. While we seem unlikely to loosen current immigration restrictions, our immigration restrictions have caused and continue to cause people who would otherwise have founded companies in the U.S. to found companies elsewhere. Given that the U.S. doesn't have a monopoly on people who found startups and that we do our best to keep people who want to found startups here out, it seems inevitable that there will eventually be Facebooks and Googles founded outside of the U.S. who compete for programmers the same way companies compete inside the U.S.

Another theory that I've heard a lot lately is that programmers at large companies get paid a lot because of the phenomenon described in Kremer's O-ring model. This model assumes that productivity is multiplicative. If your co-workers are better, you're more productive and produce more value. If that's the case, you expect a kind of assortive matching where you end up with high-skill firms that pay better, and low-skill firms that pay worse. This model has a kind of intuitive appeal to it, but it can't explain why programming compensation has higher dispersion than (for example) electrical engineering compensation. With the prevalence of open source, it's much easier to utilize the work of productive people outside your firm than in most fields. This model should be less true of programming than in most engineering fields, but the dispersion in compensation is higher.

A related theory that can't be correct for similar reasons is that high-paid software engineers are extra elite, the best of the best, and are simply paid more because they're productive. If you look at how many programmers the BLS says exist in the US (on the order of a few million) and how many engineers high-paying tech companies employ in the U.S. (on the order of a couple or a few hundred thousand), high-paying software compnaies literally can't consist of the top 1%. Even if their filters were perfect (as opposed to the complete joke that they're widely regarded to be), they couldn't be better than 90%-ile. Realistically, it's more likely that the median programmer at a high-paying tech company is a bit above 50%-ile.

The most common theory I've heard is that "software is eating the world". The theory goes: of course programmers get paid a lot and will continue to get paid a lot because software is important and only becoming more important. Despite being the most commonly stated theory I've heard, this seems nonsensical if you compare other fields. You could've said this about microprocessor design in the late 90s as well as fiber optics. Those fields are both more important today than they were in the 90s, not only is there more demand for processing power and bandwidth than ever before, demand for software is actually dependent on those. And yet, the optics engineering job market still hasn't recovered from the dot com crash and the microprocessor design engineer market, after recovering, still pays experienced PhDs less than a CS new grad at Facebook.

Furthermore, any argument for high programmer pay that relies on some inherent property of market conditions, the economy at large, the impact of programming, etc., seems like it cannot be correct if you look at what's actually driven up programmer pay. FB declined to participate in Google/Apple wage fixing agreement that became basically industry wide, which mean that FB was outpaying other major tech companies. When the wage-fixing agreement was lifted, other companies "had to" come close to matching FB compensation to avoid losing people both to FB and to each other. When they did that, FB kept raising the bar on compensation and compensation kept getting better. [2022 update] This can most clearly be seen with changes to benefits and pay structure, where FB would make a change, Google would follow suit immediately, and other companies would pick up the change later, as when FB removed vesting cliffs and Google did the same within weeks and the change trickled out across the industry. There are companies that were paying programmers as well or better than FB, like Netflix and a variety of finance companies, but major tech companies tended to not match offers from those places because they were too small to hire away enough programmers to be concerning, but FB is large and hires enough to be a concern to Google, which matches FB and combined, are large enough to be a concern to other major tech companies.

Because the mechanism for compensation increases has been arbitrary (FB could not exist, or the person who's in total control of FB, Zuckerberg, could decide on different compensation policy), it's quite arbitrary that programmer pay is as good as it is.

In conclusion, high programmer pay seems like a mystery to me and would love to hear a compelling theory for why programming "should" pay more than other similar fields, or why it should pay as much as fields that have much higher barriers to entry.

Update

Eric Roberts has observed that it takes a long time for CS enrollments to recover after a downturn, leading to a large deficit in the number of people with CS degrees vs. demand.

More than a one decade lag between downturn and recovery in enrollment

The 2001 bubble bursting caused a severe drop in CS enrollment. CS enrollment didn't hit its previous peak again until 2014, and if you fit the graph and extrapolate against the peaks, it took another year or two for enrollments to hit the historical trend. If we didn't have any data, it wouldn't be surprising to find that there's a five year delay. Of the people who graduate in four years (as opposed to five or more), most aren't going to change their major after mid or late sophmore year, so that's already two to three years of delay right there. And after a downturn, it takes some time to recover, so we'd expect at least another two to three years. Roberts makes a case that the additional latency came from a number of other factors including the fear that even though things looked ok, jobs would be outsourced soon and a slow response by colleges.

Dan Wang has noted that, according to the SO survey, 3/4 of developers have a BS degree (or higher). If it's statistically "hard" to get a high-paying job without a CS degree and there's over a decade hangover from the 2001 downturn, that could explain why programmer compensation is so high. Of course, most of us know people in the industry without a degree, but it seems to be harder to find an entry-level position without a credential.

It's not clear what this means for the future. Even if the lack of candidates with the appropriate credential is a major driver in programmer compensation, it's unclear what the record CS enrollments over the past few years means for future compensation. It's possible that record enrollments mean that we should expect compensation to come back down to the levels we see in other fields that require similar skills, like electrical engineering. It's also possible that enrollment continues to lag behind demand by a decade and that record enrollments are just keeping pace with demand from a decade ago, in which case we might expect elevated compensation to persist (as long as other factors, like hiring outside of the U.S., don't influence things too much). Since there's so much latency, another possibility is that enrollment has or will overshoot demand and we should expect compensaton programmer compensation to decline. And it's not even clear that the Roberts paper makes sense as an explanation for high current comp because Roberts also found a huge capacity crunch in the 80s and, while some programmers were paid very well, the fraction of programmers who were paid "very well" seems to have been much smaller than it is today. Google alone employs 30k engineers. If 20k are programmers in the U.S. and the estimate that there are 3 million programmers in the U.S., Google alone employs 0.6% of programmers in the U.S. If you add in the other large companies that are known to pay competitively (Amazon, Facebook, etc.), that's a significant fraction of all programmers in the U.S., which I believe is quite different from the situation in the 80s.

The most common response I've gotten to this post is that we should expect programmers to be well-paid because software is everywhere and there will be at least as much software in the future. This exact same line of reasoning could apply to electrical engineering, which is more fundamental than software, in that software requires hardware, and yet electrical engineering comp isn't in the same league as programmer comp. Highly paid programmers couldn't get their work done without microprocessors, and there are more processors sold than ever before, but the comp packages for a "senior" person at places like Intel and Qualcomm aren't even within a factor of two as at Google or Facebook. You could also make a similar argument for people who work on water and sewage systems, but those folks also don't seem compensation that's in the same range as programmers either. Any argument of the form, "the price for X is high because X is important" implicitly assumes that there's some force constraining the supply of X. The claim that "X is important" or "we need a lot of X" is missing half the story. Another problem with claims like "X is important" or "X is hard" is that these statements don't seem any less true of industries that pay much less. If your explanation of why programmers are well paid is just as true of any "classical" engineering discipline, you need some explanation of why those other fields shouldn't be as well paid.

The second most common comment that I hear is that, of course programmers are well paid, software companies are worth so much, which makes it inevitable. But there's nothing inevitable about workers actually being well compensated because a company is profitable. Someone who made this argument sent me a link to this list of the most profitable companies per employee. The list has some software companies that pay quite well, like Alphabet (Google) and Facebook, but we also see hardware companies like Qualcomm, Cisco, TSMC (and arguably SoftBank now that they've acquired ARM) that don't even pay as well software companies that don't turn a profit or that barely make money and have no path to being wildly profitable in the future. Moreover, the compensation at the software companies that are listed isn't very strongly related to their profit per employee.

To take a specific example that I'm familiar with because I grew up in Madison, the execs at Epic Systems have built a company that's generated so much wealth that its founder has an estimated net worth of $3.6 billion, which is much more than all but the most successful founders in tech. But line engineers at Epic are paid significantly less than engineers at tech companies that compete with SV for talent, even tech companies that have never made any money. What is it about some software companies that make a similar amount of money that prevents them from funneling virtually all of the wealth they generate up to the top? The typical answer to this cost of living, but as we've seen, that makes even less sense than usual in this case since Google has an office in the same city as Epic, and Google pays well over double what Epic does for typical dev. If there were some kind of simple cost of living adjustment, you'd expect Google to pay less in Madison than in Toronto or London, but it seems to be the other way around. This isn't unique to Madison — just for example, you can find a number of successful software companies in Austin that pay roughly half what Amazon and Facebook pay in the same city, where upper management does very well for themselves and line engineers make a fine living, but nowhere near as much as they'd make if they moved to a company like Amazon or Facebook.

The thing all of these theories have in common is that they apply to other fields as well, so they cannot be, as stated, the reason programmers are better paid than people in these other fields. Someone could argue that programming has a unique combination of many of these or that one these reasons should be expected to apply much more strongly than to any other field, but I haven't seen anyone make that case. Instead, people just make obviously bogus statements like "programming is really hard" (which is only valid as a reason, in this discussion, if literally the hardest field in existence and much harder than other engineering fields).


  1. People often worry that comp surveys will skew high because people want to brag, but the reality seems to be that numbers skew low because people feel embarrassed about sounding like they're bragging. I have a theory that you can see this reflected in the prices of other goods. For example, if you look at house prices, they're generally predictable based on location, square footage, amenities, and so on. But there's a significant penalty for having the largest house on the block, for what (I suspect) is the same reason people with the highest compensation disproportionately don't participate in #talkpay: people don't want to admit that they have the highest pay, have the biggest house, or drive the fanciest car. Well, some people do, but on average, bragging about that stuff is seen as gauche. [return]
  2. There's a funny move some companies will do where they station the new employee in Canada for a year before importing them into the U.S., which gets them into a visa process that's less competitive. But this is enough of a hassle that most employees balk at the idea. [return]

How I learned to program

2016-09-12 16:41:26

Tavish Armstrong has a great document where he describes how and when he learned the programming skills he has. I like this idea because I've found that the paths that people take to get into programming are much more varied than stereotypes give credit for, and I think it's useful to see that there are many possible paths into programming.

Personally, I spent a decade working as an electrical engineer before taking a programming job. When I talk to people about this, they often want to take away a smooth narrative of my history. Maybe it's that my math background gives me tools I can apply to a lot of problems, maybe it's that my hardware background gives me a good understanding of performance and testing, or maybe it's that the combination makes me a great fit for hardware/software co-design problems. People like a good narrative. One narrative people seem to like is that I'm a good problem solver, and that problem solving ability is generalizable. But reality is messy. Electrical engineering seemed like the most natural thing in the world, and I picked it up without trying very hard. Programming was unnatural for me, and didn't make any sense at all for years. If you believe in the common "you either have it or you don't" narrative about programmers, I definitely don't have it. And yet, I now make a living programming, and people seem to be pretty happy with the work I do.

How'd that happen? Well, if we go back to the beginning, before becoming a hardware engineer, I spent a fair amount of time doing failed kid-projects (e.g., writing a tic-tac-toe game and AI) and not really "getting" programming. I do sometimes get a lot of value out of my math or hardware skills, but I suspect I could teach someone the actually applicable math and hardware skills I have in less than a year. Spending five years in a school and a decade in industry to pick up those skills was a circuitous route to getting where I am. Amazingly, I've found that my path has been more direct than that of most of my co-workers, giving the lie to the narrative that most programmers are talented whiz kids who took to programming early.

And while I only use a small fraction of the technical skills I've learned on any given day, I find that I have a meta-skill set that I use all the time. There's nothing profound about the meta-skill set, but because I often work in new (to me) problem domains, I find my meta skillset to be more valuable than my actual skills. I don't think that you can communicate the importance of meta-skills (like communication) by writing a blog post any more than you can explain what a monad is by saying that it's like a burrito. That being said, I'm going to tell this story anyway.

Ineffective fumbling (1980s - 1996)

Many of my friends and I tried and failed multiple times to learn how to program. We tried BASIC, and could write some simple loops, use conditionals, and print to the screen, but never figured out how to do anything fun or useful.

We were exposed to some kind of lego-related programming, uhhh, thing in school, but none of us had any idea how to do anything beyond what was in the instructions. While it was fun, it was no more educational than a video game and had a similar impact.

One of us got a game programming book. We read it, tried to do a few things, and made no progress.

High school (1996 - 2000)

Our ineffective fumbling continued through high school. Due to an interest in gaming, I got interested in benchmarking, which eventually led to learning about CPUs and CPU microarchitecture. This was in the early days of Google, before Google Scholar, and before most CS/EE papers could be found online for free, so this was mostly material from enthusiast sites. Luckily, the internet was relatively young, as were the users on the sites I frequented. Much of the material on hardware was targeted at (and even written by) people like me, which made it accessible. Unfortunately, a lot of the material on programming was written by and targeted at professional programmers, things like Paul Hsieh's optimization guide. There were some beginner-friendly guides to programming out there, but my friends and I didn't stumble across them.

We had programming classes in high school: an introductory class that covered Visual Basic and an AP class that taught C++. Both classes were taught by someone who didn't really know how to program or how to teach programming. My class had a couple of kids who already knew how to program and would make good money doing programming competitions on topcoder when it opened, but they failed to test out of the intro class because that test included things like a screenshot of the VB6 IDE, where you got a point for correctly identifying what each button did. The class taught about as much as you'd expect from a class where the pre-test involved identifying UI elements from an IDE.

The AP class the year after was similarly effective. About halfway through the class, a couple of students organized an independent study group which worked through an alternate textbook because the class was clearly not preparing us for the AP exam. I passed the AP exam because it was one of those multiple choice tests that's possible to pass without knowing the material.

Although I didn't learn much, I wouldn't have graduated high school if not for AP classes. I failed enough individual classes that I almost didn't have enough credits to graduate. I got those necessary credits for two reasons: first, a lot of the teachers had a deal where, if you scored well on the AP exam, they would give you a passing grade in the class (usually an A, but sometimes a B). Even that wouldn't have been enough if my chemistry teacher hadn't also changed my grade to a passing grade when he found out I did well on the AP chemistry test1.

Other than not failing out of high school, I'm not sure I got much out of my AP classes. My AP CS class actually had a net negative effect on my learning to program because the AP test let me opt out of the first two intro CS classes in college (an introduction to programming and a data structures course). In retrospect, I should have taken the intro classes, but I didn't, which left me with huge holes in my knowledge that I didn't really fill in for nearly a decade.

College (2000 - 2003)

Because I'd nearly failed out of high school, there was no reasonable way I could have gotten into a "good" college. Luckily, I grew up in Wisconsin, a state with a "good" school that used a formula to determine who would automatically get admitted: the GPA cutoff depended on standardized test scores, and anyone with standardized test scores above a certain mark was admitted regardless of GPA. During orientation, I talked to someone who did admissions and found out that my year was the last year they used the formula.

I majored in computer engineering and math for reasons that seem quite bad in retrospect. I had no idea what I really wanted to study. I settled on either computer engineering or engineering mechanics because both of those sounded "hard".

I made a number of attempts to come up with better criteria for choosing a major. The most serious was when I spent a week talking to professors in an attempt to find out what day-to-day life in different fields was like. That approach had two key flaws. First, most professors don't know what it's like to work in industry; now that I work in industry and talk to folks in academia, I see that most academics who haven't done stints in industry have a lot of misconceptions about what it's like. Second, even if I managed to get accurate descriptions of different fields, it turns out that there's a wide body of research that indicates that humans are basically hopeless at predicting which activities they'll enjoy. Ultimately, I decided by coin flip.

Math

I wasn't planning on majoring in math, but my freshman intro calculus course was so much fun that I ended up adding a math major. That only happened because a high-school friend of mine passed me the application form for the honors calculus sequence because he thought I might be interested in it (he'd already taken the entire calculus sequence as well as linear algebra). The professor for the class covered the material at an unusually fast pace: he finished what was supposed to be a year-long calculus textbook in part-way through the semester and then lectured on his research for the rest of the semester. The class was theorem-proof oriented and didn't involve any of that yucky memorization that I'd previously associated with math. That was the first time I'd found school engaging in my entire life and it made me really look forward to going to math classes. I later found out that non-honors calculus involved a lot of memorization when the engineering school required me to go back and take calculus II, which I'd skipped because I'd already covered the material in the intro calculus course.

If I hadn't had a friend drop the application for honors calculus in my lap, I probably wouldn't have majored in math and it's possible I never would have found any classes that seemed worth attending. Even as it was, all of the most engaging undergrad professors I had were math professors2 and I mostly skipped my other classes. I don't know how much of that was because my math classes were much smaller, and therefore much more customized to the people in the class (computer engineering was very trendy at the time, and classes were overflowing), and how much was because these professors were really great teachers.

Although I occasionally get some use out of the math that I learned, most of the value was in becoming confident that I can learn and work through the math I need to solve any particular problem.

Engineering

In my engineering classes, I learned how to debug and how computers work down to the transistor level. I spent a fair amount of time skipping classes and reading about topics of interest in the library, which included things like computer arithmetic and circuit design. I still have fond memories of Koren's Computer Arithmetic Algorithms, Chandrakasan et al.'s Design of High-Performance Microprocessor Circuits. I also started reading papers; I spent a lot of time in libraries reading physics and engineering papers that mostly didn't make sense to me. The notable exception were systems papers, which I found to be easy reading. I distinctly remember reading the Dynamo paper (this was HP's paper on JITs, not the more recent Amazon work of the same name), but I can't recall any other papers I read back then.

Internships

I had two internships, one at Micron where I "worked on" flash memory, and another at IBM where I worked on the POWER6. The Micron internship was a textbook example of a bad internship. When I showed up, my manager was surprised that he was getting an intern and had nothing for me to do. After a while (perhaps a day), he found an assignment for me: press buttons on a phone. He'd managed to find a phone that used Micron flash chips; he handed it to me, told me to test it, and walked off.

After poking at the phone for an hour or two and not being able to find any obvious bugs, I walked around and found people who had tasks I could do. Most of them were only slightly less manual than "testing" a phone by mashing buttons, but I did one not-totally-uninteresting task, which was to verify that a flash chip's controller behaved correctly. Unlike my other tasks, this was amenable to automation and I was able to write a Perl script to do the testing for me.

I chose Perl because someone had a Perl book on their desk that I could borrow, which seemed like as good a reason as any at the time. I called up a friend of mine to tell him about this great "new" language and we implemented Age of Renaissance, a board game we'd played in high school. We didn't finish, but Perl was easy enough to use that we felt like we could write a program that actually did something interesting.

Besides learning Perl, I learned that I could ask people for books and read them, and I spent most of the rest of my internship half keeping an eye on a manual task while reading the books people had lying around. Most of the books had to do with either analog circuit design or flash memory, so that's what I learned. None of the specifics have really been useful to me in my career, but I learned two meta-items that were useful.

First, no one's going to stop you from spending time reading at work or spending time learning (on most teams). Micron did its best to keep interns from learning by having a default policy of blocking interns from having internet access (managers could override the policy, but mine didn't), but no one will go out of their way to prevent an intern from reading books when their other task is to randomly push buttons on a phone.

Second, I learned that there are a lot of engineering problems we can solve without anyone knowing why. One of the books I read was a survey of then-current research on flash memory. At the time, flash memory relied on some behaviors that were well characterized but not really understood. There were theories about how the underlying physical mechanisms might work, but determining which theory was correct was still an open question.

The next year, I had a much more educational internship at IBM. I was attached to a logic design team on the POWER6, and since they didn't really know what to do with me, they had me do verification on the logic they were writing. They had a relatively new tool called SixthSense, which you can think of as a souped-up quickcheck. The obvious skill I learned was how to write tests using a fancy testing framework, but the meta-thing I learned which has been even more useful is the fact that writing a test-case generator and a checker is often much more productive than the manual test-case writing that passes for automated testing in most places.

The other thing I encountered for the first time at IBM was version control (CVS, unfortunately). Looking back, I find it a bit surprising that not only did I never use version control in any of my classes, but I'd never met any other students who were using version control. My IBM internship was between undergrad and grad school, so I managed to get a B.S. degree without ever using or seeing anyone use version control.

Computer Science

I took a couple of CS classes. The first was algorithms, which was poorly taught and so heavily curved as a result that I got an A despite not learning anything at all. The course involved no programming and while I could have done some implementation in my free time, I was much more interested in engineering and didn't try to apply any of the material.

The second course was databases. There were a couple of programming projects, but they were all projects where you got some scaffolding and only had to implement a few key methods to make things work, so it was possible to do ok without having any idea how to program. I got involved in a competition to see who could attend fewest possible classes, didn't learn anything, and scraped by with a B.

Grad school (2003 - 2005)

After undergrad, I decided to go to grad school for a couple of silly reasons. One was a combination of "why not?" and the argument that most of professors gave, which was that you'll never go if you don't go immediately after undergrad because it's really hard to go back to school later. But the reason that people don't go back later is because they have more information (they know what both school and work are like), and they almost always choose work! The other major reason was that I thought I'd get a more interesting job with a master's degree. That's not obviously wrong, but it appears to be untrue in general for people going into electrical engineering and programming.

I don't know that I learned anything that I use today, either in the direct sense or in a meta sense. I had some great professors3 and I made some good friends, but I think that this wasn't a good use of time because of two bad decisions I made at the age of 19 or 20. Rather than attended a school that had a lot of people working in an area I was interested in, I went with a school that gave me a fellowship that only had one person working in an area I was really interested. That person left just before I started.

I ended up studying optics, and while learning a new field was a lot of fun, the experience was of no particular value to me, and I could have had fun studying something I had more of an interest in.

While I was officially studying optics, I still spent a lot of time learning unrelated things. At one point, I decided I should learn Lisp or Haskell, probably because of something Paul Graham wrote. I couldn't find a Lisp textbook in the library, but I found a Haskell textbook. After I worked through the exercises, I had no idea how to accomplish anything practical. But I did learn about list comprehensions and got in the habit of using higher-order functions.

Based on internet comments and advice, I had the idea learning more languages would teach me how to be a good programmer so I worked through introductory books on Python and Ruby. As far as I can tell, this taught me basically nothing useful and I would have been much better off learning about a specific area (like algorithms or networking) than learning lots of languages.

First real job (2005 - 2013)

Towards the end of grad school, I mostly looked for, and found, electrical/computer engineering jobs. The one notable exception was Google, which called me up in order to fly me out to Mountain View for an interview. I told them that they probably had the wrong person because they hadn't even done a phone screen, so they offered to do a phone interview instead. I took the phone interview expecting to fail because I didn't have any CS background, and I failed as expected. In retrospect, I should have asked to interview for a hardware position, but at the time I didn't know they had hardware positions, even though they'd been putting together their own servers and designing some of their own hardware for years.

Anyway, I ended up at a little chip company called Centaur. I was hesitant about taking the job because the interview was the easiest interview I had at any company4, which made me wonder if they had a low hiring bar, and therefore relatively weak engineers. It turns out that, on average, that's the best group of people I've ever worked with. I didn't realize it at the time, but this would later teach me that companies that claim to have brilliant engineers because they have super hard interviews are full of it, and that the interview difficulty one-upmanship a lot of companies promote is more of a prestige play than anything else.

But I'm getting ahead of myself -- my first role was something they call "regression debug", which included debugging test failures for both newly generated tests as well as regression tests. The main goal of this job was to teach new employees the ins-and-outs of the x86 architecture. At the time, Centaur's testing was very heavily based on chip-level testing done by injecting real instructions, interrupts, etc., onto the bus, so debugging test failures taught new employees everything there is to know about x86.

The Intel x86 manual is thousands of pages long and it isn't sufficient to implement a compatible x86 chip. When Centaur made its first x86 chip, they followed the Intel manual in perfect detail, and left all instances of undefined behavior up to individual implementers. When they got their first chip back and tried it, they found that some compilers produced code that relied on the behavior that's technically undefined on x86, but happened to always be the same on Intel chips. While that's technically a compiler bug, you can't ship a chip that isn't compatible with actually existing software, and ever since then, Centaur has implemented x86 chips by making sure that the chips match the exact behavior of Intel chips, down to matching officially undefined behavior5.

For years afterwards, I had encyclopedic knowledge of x86 and could set bits in control registers and MSRs from memory. I didn't have a use for any of that knowledge at any future job, but the meta-skill of not being afraid of low-level hardware comes in handy pretty often, especially when I run into compiler or chip bugs. People look at you like you're a crackpot if you say you've found a hardware bug, but because we were so careful about characterizing the exact behavior of Intel chips, we would regularly find bugs and then have discussions about whether we should match the bug or match the spec (the Intel manual).

The other thing I took away from the regression debug experience was a lifelong love of automation. Debugging often involves a large number of mechanical steps. After I learned enough about x86 that debugging became boring, I started automating debugging. At that point, I knew how to write simple scripts but didn't really know how to program, so I wasn't able to totally automate the process. However, I was able to automate enough that, for 99% of failures, I just had to glance at a quick summary to figure out what the bug was, rather than spend what might be hours debugging. That turned what was previously a full-time job into something that took maybe 30-60 minutes a day (excluding days when I'd hit a bug that involved some obscure corner of x86 I wasn't already familiar with, or some bug that my script couldn't give a useful summary of).

At that point, I did two things that I'd previously learned in internships. First, I started reading at work. I began with online commentary about programming, but there wasn't much of that, so I asked if I could expense books and read them at work. This seemed perfectly normal because a lot of other people did the same thing, and there were at least two people who averaged more than one technical book per week, including one person who averaged a technical book every 2 or 3 days.

I settled in at a pace of somewhere between a book a week and a book a month. I read a lot of engineering books that imparted some knowledge that I no longer use, now that I spend most of my time writing software; some "big idea" software engineering books like Design Patterns and Refactoring, which I didn't really appreciate because I was just writing scripts; and a ton of books on different programming languages, which doesn't seem to have had any impact on me.

The only book I read back then that changed how I write software in a way that's obvious to me was The Design of Everyday Things. The core idea of the book is that while people beat themselves up for failing to use hard-to-understand interfaces, we should blame designers for designing poor interfaces, not users for failing to use them.

If you ever run into a door that you incorrectly try to pull instead of push (or vice versa) and have some spare time, try watching how other people use the door. Whenever I do this, I'll see something like half the people who try the door use it incorrectly. That's a design flaw!

The Design of Everyday Things has made me a lot more receptive to API and UX feedback, and a lot less tolerant of programmers who say things like "it's fine -- everyone knows that the arguments to foo and bar just have to be given in the opposite order" or "Duh! Everyone knows that you just need to click on the menu X, select Y, navigate to tab Z, open AA, go to tab AB, and then slide the setting to AC."

I don't think all of that reading was a waste of time, exactly, but I would have been better off picking a few sub-fields in CS or EE and learning about them, rather than reading the sorts of books O'Reilly and Manning produce.

It's not that these books aren't useful, it's that almost all of them are written to make sense without any particular background beyond what any random programmer might have, and you can only get so much out of reading your 50th book targeted at random programmers. IMO, most non-academic conferences have the same problem. As a speaker, you want to give a talk that works for everyone in the audience, but a side effect of that is that many talks have relatively little educational value to experienced programmers who have been to a few conferences.

I think I got positive things out of all that reading as well, but I don't know yet how to figure out what those things are.

As a result of my reading, I also did two things that were, in retrospect, quite harmful.

One was that I really got into functional programming and used a functional style everywhere I could. Immutability, higher-order X for any possible value of X, etc. The result was code that I could write and modify quickly that was incomprehensible to anyone but a couple of coworkers who were also into functional programming.

The second big negative was that I became convinced that Perl was causing us a lot of problems. We had Perl scripts that were hard to understand and modify. They'd often be thousands of lines of code with only one or two functions and no tests which used every obscure Perl feature you could think of. Static! Magic sigils! Implicit everything! You name it, we used it. For me, the last straw was when I inserted a new function between two functions which didn't explicitly pass any arguments and return values -- and broke the script because one of the functions was returning a value into an implicit variable which was getting read by the next function. By putting another function in between the two closely coupled functions, I broke the script.

After that, I convinced a bunch of people to use Ruby and started using it myself. The problem was that I only managed to convince half of my team to do this The other half kept using Perl, which resulted in language fragmentation. Worse yet, in another group, they also got fed up with Perl, but started using Python, resulting in the company having code in Perl, Python, and Ruby.

Centaur has an explicit policy of not telling people how to do anything, which precludes having team-wide or company-wide standards. Given the environment, using a "better" language seemed like a natural thing to do, but I didn't recognize the cost of fragmentation until, later in my career, I saw a company that uses standardization to good effect.

Anyway, while I was causing horrific fragmentation, I also automated away most of my regression debug job. I got bored of spending 80% of my time at work reading and I started poking around for other things to do, which is something I continued for my entire time at Centaur. I like learning new things, so I did almost everything you can do related to chip design. The only things I didn't do were circuit design (the TL of circuit design didn't want a non-specialist interfering in his area) and a few roles where I was told "Dan, you can do that if you really want to, but we pay you too much to have you do it full-time."

If I hadn't interviewed regularly (about once a year, even though I was happy with my job), I probably would've wondered if I was stunting my career by doing so many different things, because the big chip companies produce specialists pretty much exclusively. But in interviews I found that my experience was valued because it was something they couldn't get in-house. The irony is that every single role I was offered would have turned me into a specialist. Big chip companies talk about wanting their employees to move around and try different things, but when you dig into what that means, it's that they like to have people work one very narrow role for two or three years before moving on to their next very narrow role.

For a while, I wondered if I was doomed to either eventually move to a big company and pick up a hyper-specialized role, or stay at Centaur for my entire career (not a bad fate -- Centaur has, by far, the lowest attrition rate of any place I've worked because people like it so much). But I later found that software companies building hardware accelerators actually have generalist roles for hardware engineers, and that software companies have generalist roles for programmers, although that might be a moot point since most software folks would probably consider me an extremely niche specialist.

Regardless of whether spending a lot of time in different hardware-related roles makes you think of me as a generalist or a specialist, I picked up a lot of skills which came in handy when I worked on hardware accelerators, but that don't really generalize to the pure software project I'm working on today. A lot of the meta-skills I learned transfer over pretty well, though.

If I had to pick the three most useful meta-skills I learned back then, I'd say they were debugging, bug tracking, and figuring out how to approach hard problems.

Debugging is a funny skill to claim to have because everyone thinks they know how to debug. For me, I wouldn't even say that I learned how to debug at Centaur, but that I learned how to be persistent. Non-deterministic hardware bugs are so much worse than non-deterministic software bugs that I always believe I can track down software bugs. In the absolute worst case, when there's a bug that isn't caught in logs and can't be caught in a debugger, I can always add tracing information until the bug becomes obvious. The same thing's true in hardware, but "recompiling" to add tracing information takes 3 months per "recompile"; compared to that experience, tracking down a software bug that takes three months to figure out feels downright pleasant.

Bug tracking is another meta-skill that everyone thinks they have, but when when I look at most projects I find that they literally don't know what bugs they have and they lose bugs all the time due to a failure to triage bugs effectively. I didn't even know that I'd developed this skill until after I left Centaur and saw teams that don't know how to track bugs. At Centaur, depending on the phase of the project, we'd have between zero and a thousand open bugs. The people I worked with most closely kept a mental model of what bugs were open; this seemed totally normal at the time, and the fact that a bunch of people did this made it easy for people to be on the same page about the state of the project and which areas were ahead of schedule and which were behind.

Outside of Centaur, I find that I'm lucky to even find one person who's tracking what the major outstanding bugs are. Until I've been on the team for a while, people are often uncomfortable with the idea of taking a major problem and putting it into a bug instead of fixing it immediately because they're so used to bugs getting forgotten that they don't trust bugs. But that's what bug tracking is for! I view this as analogous to teams whose test coverage is so low and staging system is so flaky that they don't trust themselves to make changes because they don't have confidence that issues will be caught before hitting production. It's a huge drag on productivity, but people don't really see it until they've seen the alternative.

Perhaps the most important meta-skill I picked up was learning how to solve large problems. When I joined Centaur, I saw people solving problems I didn't even know how to approach. There were folks like Glenn Henry, a fellow from IBM back when IBM was at the forefront of computing, and Terry Parks, who Glenn called the best engineer he knew at IBM. It wasn't that they were 10x engineers; they didn't just work faster. In fact, I can probably type 10x as quickly as Glenn (a hunt and peck typist) and could solve trivial problems that are limited by typing speed more quickly than him. But Glenn, Terry, and some of the other wizards knew how to approach problems that I couldn't even get started on.

I can't cite any particular a-ha moment. It was just eight years of work. When I went looking for problems to solve, Glenn would often hand me a problem that was slightly harder than I thought possible for me. I'd tell him that I didn't think I could solve the problem, he'd tell me to try anyway, and maybe 80% of the time I'd solve the problem. We repeated that for maybe five or six years before I stopped telling Glenn that I didn't think I could solve the problem. Even though I don't know when it happened, I know that I eventually started thinking of myself as someone who could solve any open problem that we had.

Grad school, again (2008 - 2010)

At some point during my tenure at Centaur, I switched to being part-time and did a stint taking classes and doing a bit of research at the local university. For reasons which I can't recall, I split my time between software engineering and CS theory.

I read a lot of software engineering papers and came to the conclusion that we know very little about what makes teams (or even individuals) productive, and that the field is unlikely to have actionable answers in the near future. I also got my name on a couple of papers that I don't think made meaningful contributions to the state of human knowledge.

On the CS theory side of things, I took some graduate level theory classes. That was genuinely educational and I really "got" algorithms for the first time in my life, as well as complexity theory, etc. I could have gotten my name on a paper that I didn't think made a meaningful contribution to the state of human knowledge, but my would-be co-author felt the same way and we didn't write it up.

I originally tried grad school again because I was considering getting a PhD, but I didn't find the work I was doing to be any more "interesting" than the work I had at Centaur, and after seeing the job outcomes of people in the program, I decided there was less than 1% chance that a PhD would provide any real value to me and went back to Centaur full time.

RC (Spring 2013)

After eight years at Centaur, I wanted to do something besides microprocessors. I had enough friends at other hardware companies to know that I'd be downgrading in basically every dimension except name recognition if I switched to another hardware company, so I started applying to software jobs.

While I was applying to jobs, I heard about RC. It sounded great, maybe even too great: when I showed my friends what people were saying about it, they thought the comments were fake. It was a great experience, and I can see why so many people raved about it, to the point where real comments sound impossibly positive. It was transformative for a lot of people; I heard a lot of exclamations like "I learned more here in 3 months here than in N years of school" or "I was totally burnt out and this was the first time I've been productive in a year". It wasn't transformative for me, but it was as fun a 3 month period as I've ever had, and I even learned a thing or two.

From a learning standpoint, the one major thing I got out of RC was feedback from Marek, whom I worked with for about two months. While the freedom and lack of oversight at Centaur was great for letting me develop my ability to work independently, I basically didn't get any feedback on my work6 since they didn't do code review while I was there, and I never really got any actionable feedback in performance reviews.

Marek is really great at giving feedback while pair programming, and working with him broke me of a number of bad habits as well as teaching me some new approaches for solving problems. At a meta level, RC is relatively more focused on pair programming than most places and it got me to pair program for the first time. I hadn't realized how effective pair programming with someone is in terms of learning how they operate and what makes them effective. Since then, I've asked a number of super productive programmers to pair program and I've gotten something out of it every time.

Second real job (2013 - 2014)

I was in the right place at the right time to land on a project that was just transitioning from Andy Phelps' pet 20% time project into what would later be called the Google TPU.

As far as I can tell, it was pure luck that I was the second engineer on the project as opposed to the fifth or the tenth. I got to see what it looks like to take a project from its conception and turn it into something real. There was a sense in which I got that at Centaur, but every project I worked on was either part of a CPU, or a tool whose goal was to make CPU development better. This was the first time I worked on a non-trivial project from its inception, where I wasn't just working on part of the project but the whole thing.

That would have been educational regardless of the methodology used, but it was a particularly great learning experience because of how the design was done. We started with a lengthy discussion on what core algorithm we were going to use. After we figured out an algorithm that would give us acceptable performance, we coded up design docs for every major module before getting serious about implementation.

Many people consider writing design docs to be a waste of time nowadays, but going through this process, which took months, had a couple big advantages. The first is that working through a design collaboratively teaches everyone on the team everyone else's tricks. It's a lot like the kind of skill transfer you get with pair programming, but applied to design. This was great for me, because as someone with only a decade of experience, I was one of the least experienced people in the room.

The second is that the iteration speed is much faster in the design phase, where throwing away a design just means erasing a whiteboard. Once you start coding, iterating on the design can mean throwing away code; for infrastructure projects, that can easily be person-years or even tens of persons-years of work. Since working on the TPU project, I've seen a couple of teams on projects of similar scope insist on getting "working" code as soon as possible. In every single case, that resulted in massive delays as huge chunks of code had to be re-written, and in a few cases the project was fundamentally flawed in a way that required the team had to start over from scratch.

I get that on product-y projects, where you can't tell how much traction you're going to get from something, you might want to get an MVP out the door and iterate, but for pure infrastructure, it's often possible to predict how useful something will be in the design phase.

The other big thing I got out of the job was a better understanding of what's possible when a company makes a real effort to make engineers productive. Something I'd seen repeatedly at Centaur was that someone would come in, take a look around, find the tooling to be a huge productivity sink, and then make a bunch of improvements. They'd then feel satisfied that they'd improved things a lot and then move on to other problems. Then the next new hire would come in, have the same reaction, and do the same thing. The result was tools that improved a lot while I was there, but not to the point where someone coming in would be satisfied with them. Google was the only place I'd worked where a lot of the tools seem like magic compared to what exists in the outside world7. Sure, people complain that a lot of the tooling is falling over, that there isn't enough documentation, and that a lot of it is out of date. All true. But the situation is much better than it's been at any other company I've worked at. That doesn't seem to actually be a competitive advantage for Google's business, but it makes the development experience really pleasant.

Third real job (2015 - 2017)

This was a surprising experience. I think I'm too close to it to really know what I got out of the experience, so fully filling in this section is a TODO.

One thing that was really interesting is thare are a lot of things I used to think of as "table stakes" for gettings things done that it appears that one can do without. An example is version control. I was and still am strongly in favor of using version control, but the project I worked on with a TL that was strongly against version control was still basically sucessful. There was a lot of overheard until we started using version control, but dealing with the fallout of not having version control and having people not really sync changes only cost me a day or two a week of manually merging in changes in my private repo to get the build to consistently work. That's obviously far from ideal, but, across the entire team, not enough of a cost to make the difference between success and failure.

RC (2017 - present)

I was pretty burnt out after my last job, so I went back to RC to do fun programming-related stuff and recharge. I haven't written up most of what I've worked on (e.g., an analysis of 80k games on Terra Mystica, MTA (NYC) subway data analysis, etc.). I've written up a few things, like latency analysis of computers, terminals, keyboards, and websites, though.

One thing my time at RC has got me thinking about is why it's so hard to get paid well to write. There appears to be a lot of demand for "good" writing, but companies don't seem very willing to create roles for people who could program but want to write. Steve Klabnik has had a tremendous impact on Rust through his writing, probably more impact than the median programmer on most projects, but my impression is that he's taking a significant pay cut over what he could make as a programmer in order to do this really useful and important thing.

I've tried pitching this kind of role at a few places and the response so far has mostly been a combination of:

  • We value writing! I don't think it makes sense to write full-time or even half-time, but you join my team where we support writing and you can write as a 20%-time project or in your spare time!
  • Uhhh, we could work something out, but why would anyone who can program want to write?

Neither of these responses makes me think that writing would actually be as valued as programming on those teams even if writing is more valued on those teams relative to most. There are some "developer evangelist" roles that involve writing, but when I read engineering blogs written by people with that title, most of the writing appears to be thinly disguised press releases (there are obviously exceptions to this, but even in the cases where blogs have interesting engineering output, the interesting output is often interleaved with pseudo press releases). In addition to be boring, that kind of thing seems pretty ineffective. At one company I worked for, I ran the traffic numbers for their developer evanglist blogs vs. my own blog, and there were a lot of months where my blog got more traffic than all of their hosted evangelist blogs combined. I don't think it's surprising to find that programmers would rather read explanations/analysis/history than PR, but it seems difficult to convince the right people of this, so I'll probably go back to a programming job after this. We'll see.

BTW, this isn't to say that I don't enjoy programming or don't think that it's important. It's just that writing seems undervalued in a way that makes it relatively easy to have outsized impact through writing. But the same the same forces that make it easy to have outsized impact also make it difficult to get paid well!

What about the bad stuff?

When I think about my career, it seems to me that it's been one lucky event after the next. I've been unlucky a few times, but I don't really know what to take away from the times I've been unlucky.

For example, I'd consider my upbringing to be mildly abusive. I remember having nights where I couldn't sleep because I'd have nightmares about my father every time I fell asleep. Being awake during the day wasn't a great experience, either. That's obviously not good and in retrospect it seems pretty directly related to the academic problems I had until I moved out, but I don't know that I could give useful advice to a younger version of myself. Don't be born into an abusive family? That's something people would already do if they had any control over the matter.

Or to pick a more recent example, I once joined a team that scored a 1 on the Joel Test. The Joel Test is now considered to be obsolete because it awards points for things like "Do you have testers?" and "Do you fix bugs before writing new code?", which aren't considered best practices by most devs today. Of the items that aren't controversial, many seem so obvious that they're not worth asking about, things like:

  • Do you use source control?
  • Can you make a build in one step?
  • Do you make (at least) daily builds?
  • Do you have a bug database?

For anyone who cares about this kind of thing, it's clearly not a great idea to join a team that does, at most, 1 item off of Joel's checklist (and the 1 wasn't any of the above). Getting first-hand experience on a team that scored a 1 didn't give me any new information that would make me reconsider my opinion.

You might say that I should have asked about those things. It's true! I should have, and I probably will in the future. However, when I was hired, the TL who was against version control and other forms of automation hadn't been hired yet, so I wouldn't have found out about this if I'd asked. Furthermore, even if he'd already been hired, I'm still not sure I would have found out about it -- this is the only time I've joined a team and then found that most of the factual statements made during the recruiting process were untrue. I made sure to ask specific, concrete, questions about the state of the project, processes, experiments that had been run, etc., but it turned out the answers were alright falsehoods. When I was on that team, every day featured a running joke between team members about how false the recruiting pitch was!

I could try to prevent similar problems in the future by asking for concrete evidence of factual claims (e.g., if someone claims the attrition rate is X, I could ask for access to the HR database to verify), but considering that I have a finite amount of time and the relatively low probability of being told outright falsehoods, I think I'm going to continue to prioritize finding out other information when I'm considering a job and just accept that there's a tiny probability I'll end up in a similar situation in the future.

When I look at the bad career-related stuff I've experienced, almost all of it falls into one of two categories: something obviously bad that was basically unavoidable, or something obviously bad that I don't know how to reasonably avoid, given limited resources. I don't see much to learn from that. That's not to say that I haven't made and learned from mistakes. I've made a lot of mistakes and do a lot of things differently as a result of mistakes! But my worst experiences have come out of things that I don't know how to prevent in any reasonable way.

This also seems to be true for most people I know. For example, something I've seen a lot is that a friend of mine will end up with a manager whose view is that managers are people who dole out rewards and punishments (as opposed to someone who believes that managers should make the team as effective as possible, or someone who believes that managers should help people grow). When you have a manager like that, a common failure mode is that you're given work that's a bad fit, and then maybe you don't do a great job because the work is a bad fit. If you ask for something that's a better fit, that's refused (why should you be rewarded with doing something you want when you're not doing good work, instead you should be punished by having to do more of this thing you don't like), which causes a spiral that ends in the person leaving or getting fired. In the most recent case I saw, the firing was a surprise to both the person getting fired and their closest co-workers: my friend had managed to find a role that was a good fit despite the best efforts of management; when management decided to fire my friend, they didn't bother to consult the co-workers on the new project, who thought that my friend was doing great and had been doing great for months!

I hear a lot of stories like that, and I'm happy to listen because I like stories, but I don't know that there's anything actionable here. Avoid managers who prefer doling out punishments to helping their employees? Obvious but not actionable.

Conclusion

The most common sort of career advice I see is "you should do what I did because I'm successful". It's usually phrased differently, but that's the gist of it. That basically never works. When I compare notes with friends and acquaintances, it's pretty clear that my career has been unusual in a number of ways, but it's not really clear why.

Just for example, I've almost always had a supportive manager who's willing to not only let me learn whatever I want on my own, but who's willing to expend substantial time and effort to help me improve as an engineer. Most folks I've talked to have never had that. Why the difference? I have no idea.

One story might be: the two times I had unsupportive managers, I quickly found other positions, whereas a lot of friends of mine will stay in roles that are a bad fit for years. Maybe I could spin it to make it sound like the moral of the story is that you should leave roles sooner than you think, but both of the bad situations I ended up in, I only ended up in because I left a role sooner than I should have, so the advice can't be "prefer to leave roles sooner than you think". Maybe the moral of the story should be "leave bad roles more quickly and stay in good roles longer", but that's so obvious that it's not even worth stating. This is arguably non-obvious because people do, in fact, stay in roles where they're miserable, but when I think of people who do so, they fall into one of two categories. Either they're stick for extrinsic reasons (e.g., need to wait out the visa clock) or they know that they should leave but can't bring themselves to do so. There's not much to do about the former case, and in the latter case, knowing that they should leave isn't the problem. Every strategy that I can think of is either incorrect in the general case, or so obvious there's no reason to talk about it.

Another story might be: I've learned a lot of meta-skills that are valuable, so you should learn these skills. But you probably shouldn't. The particular set of meta-skills I've picked have been great for me because they're skills I could easily pick up in places I worked (often because I had a great mentor) and because they're things I really strongly believe in doing. Your circumstances and core beliefs are probably different from mine and you have to figure out for yourself what it makes sense to learn.

Yet another story might be: while a lot of opportunities come from serendipity, I've had a lot of opportunities because I spend a lot of time generating possible opportunities. When I passed around the draft of this post to some friends, basically everyone told me that I emphasized luck too much in my narrative and that all of my lucky breaks came from a combination of hard work and trying to create opportunities. While there's a sense in which that's true, many of my opportunities also came out of making outright bad decisions.

For example, I ended up at Centaur because I turned down the chance to work at IBM for a terrible reason! At the end of my internship, my manager made an attempt to convince me to stay on as a full-time employee, but I declined because I was going to grad school. But I was only going to grad school because I wanted to get a microprocessor logic design position, something I thought I couldn't get with just a bachelor's degree. But I could have gotten that position if I hadn't turned my manager down! I'd just forgotten the reason that I'd decided to go to grad school and incorrectly used the cached decision as a reason to turn down the job. By sheer luck, that happened to work out well and I got better opportunities than anyone I know from my intern cohort who decided to take a job at IBM. Have I "mostly" been lucky or prepared? Hard to say; maybe even impossible.

Careers don't have the logging infrastructure you'd need to determine the impact of individual decisions. Careers in programming, anyway. Many sports now track play-by-play data in a way that makes it possible to try to determine how much of success in any particular game or any particular season was luck and how much was skill.

Take baseball, which is one of the better understood sports. If we look at the statistical understanding we have of performance today, it's clear that almost no one had a good idea about what factors made players successful 20 years ago. One thing I find particularly interesting is that we now have much better understanding of which factors are fundamental and which factors come down to luck, and it's not at all what almost anyone would have thought 20 years ago. We can now look at a pitcher and say something like "they've gotten unlucky this season, but their foo, bar, and baz rates are all great so it appears to be bad luck on balls in play as opposed any sort of decline in skill", and we can also make statements like "they've done well this season but their fundamental stats haven't moved so it's likely that their future performance will be no better than their past performance before this season". We couldn't have made a statement like that 20 years ago. And this is a sport that's had play-by-play video available going back what seems like forever, where play-by-play stats have been kept for a century, etc.

In this sport where everything is measured, it wasn't until relatively recently that we could disambiguate between fluctuations in performance due to luck and fluctuations due to changes in skill. And then there's programming, where it's generally believed to be impossible to measure people's performance and the state of the art in grading people's performance is that you ask five people for their comments on someone and then aggregate the comments. If we're only just now able to make comments on what's attributable to luck and what's attributable to skill in a sport where every last detail of someone's work is available, how could we possibly be anywhere close to making claims about what comes down to luck vs. other factors in something as nebulous as a programming career?

In conclusion, life is messy and I don't have any advice.

Appendix A: meta-skills I'd like to learn

Documentation

I once worked with Jared Davis, a documentation wizard whose documentation was so good that I'd go to him to understand how a module worked before I talked to the owner the module. As far as I could tell, he wrote documentation on things he was trying to understand to make life easier for himself, but his documentation was so good that it was a force multiplier for the entire company.

Later, at Google, I noticed a curiously strong correlation between the quality of initial design docs and the success of projects. Since then, I've tried to write solid design docs and documentation for my projects, but I still have a ways to go.

Fixing totally broken situations

So far, I've only landed on teams where things are much better than average and on teams where things are much worse than average. You might think that, because there's so much low hanging fruit on teams that are much worse than average, it should be easier to improve things on teams that are terrible, but it's just the opposite. The places that have a lot of problems have problems because something makes it hard to fix the problems.

When I joined the team that scored a 1 on the Joel Test, it took months of campaigning just to get everyone to use version control.

I've never seen an environment go from "bad" to "good" and I'd be curious to know what that looks like and how it happens. Yossi Kreinin's thesis is that only management can fix broken situations. That might be true, but I'm not quite ready to believe it just yet, even though I don't have any evidence to the contrary.

Appendix B: other "how I became a programmer" stories

Kragen. Describes 27 years of learning to program. Heavy emphasis on conceptual phases of development (e.g., understanding how to use provided functions vs. understanding that you can write arbitrary functions)

Julia Evans. Started programming on a TI-83 in 2004. Dabbled in programming until college (2006-2011) and has been working as a professional programmer ever since. Some emphasis on the "journey" and how long it takes to improve.

Philip Guo. A non-traditional story of learning to program, which might be surprising if you know that Philip's career path was MIT -> Stanford -> Google.

Tavish Armstrong. 4th grade through college. Emphasis on particular technologies (e.g., LaTeX or Python).

Caitie McCaffrey. Started programming in AP computer science. Emphasis on how interests led to a career in programming.

Matt DeBoard. Spent 12 weeks learning Django with the help of a mentor. Emphasis on the fact that it's possible to become a programmer without programming background.

Kristina Chodorow. Started in college. Emphasis on alternatives (math, grad school).

Michael Bernstein. Story of learning Haskell over the course of years. Emphasis on how long it took to become even minimally proficient.

Thanks to Leah Hanson, Lindsey Kuper, Kelley Eskridge, Jeshua Smith, Tejas Sapre, Joe Wilder, Adrien Lamarque, Maggie Zhou, Lisa Neigut, Steve McCarthy, Darius Bacon, Kaylyn Gibilterra, Sarah Ransohoff, @HamsterRaging, and "biktian" for comments/criticism/discussion.


  1. If you happen to have contact information for Mr. Swanson, I'd love to be able to send a note saying thanks. [return]
  2. Wayne Dickey, Richard Brualdi, Andreas Seeger, and a visiting professor whose name escapes me. [return]
  3. I strongly recommend Andy Weiner for any class, as well as the guy who taught mathematical physics when I sat in on it, but I don't remember who that was or if that's even the exact name of the class. [return]
  4. with the exception of one government lab, which gave me an offer on the strength of a non-technical on-campus interview. I believe that was literally the first interview I did when I was looking for work, but they didn't get back to me until well after interview season was over and I'd already accepted an offer. I wonder if that's because they went down the list of candidates in some order and only got to me after N people turned them down or if they just had a six month latency on offers. [return]
  5. Because Intel sees no reason to keep its competitors informed about what it's doing, this results in a substantial latency when matching new features. They usually announce enough information that you can implement the basic functionality, but behavior on edge cases may vary. We once had a bug (noticed and fixed well before we shipped, but still problematic) where we bought an engineering sample off of ebay and implemented some new features based on the engineering sample. This resulted in an MWAIT bug that caused Windows to hang; Intel had changed the behavior of MWAIT between shipping the engineering sample and shipping the final version.

    I recently saw a post that claims that you can get great performance per dollar by buying some engineering samples off of ebay. Don't do this. Engineering samples regularly have bugs. Sometimes those bugs are actual bugs, and sometimes it's just that Intel changed their minds. Either way, you really don't want to run production systems off of engineering samples.

    [return]
  6. I occasionally got feedback by taking a problem I'd solved to someone and asking them if they had any better ideas, but that's much less in depth than the kind of feedback I'm talking about here. [return]
  7. To pick one arbitrary concrete example, look at version control at Microsoft from someone who worked on Windows Vista:

    In small programming projects, there's a central repository of code. Builds are produced, generally daily, from this central repository. Programmers add their changes to this central repository as they go, so the daily build is a pretty good snapshot of the current state of the product.

    In Windows, this model breaks down simply because there are far too many developers to access one central repository. So Windows has a tree of repositories: developers check in to the nodes, and periodically the changes in the nodes are integrated up one level in the hierarchy. At a different periodicity, changes are integrated down the tree from the root to the nodes. In Windows, the node I was working on was 4 levels removed from the root. The periodicity of integration decayed exponentially and unpredictably as you approached the root so it ended up that it took between 1 and 3 months for my code to get to the root node, and some multiple of that for it to reach the other nodes. It should be noted too that the only common ancestor that my team, the shell team, and the kernel team shared was the root.

    Google and Microsoft both maintained their own forks of perforce because that was the most scalable source control system available at the time. Google would go on to build piper, a distributed version control system (in the distributed systems sense, not in the git sense) that solved the scaling problem, despite having a dev experience that wasn't nearly as painful. But that option wasn't really on the table at Microsoft. In the comments to the post quoted above, a then-manager at Microsoft commented that the possible options were:

    1. federate out the source tree, and pay the forward and reverse integration taxes (primarily delay in finding build breaks), or...
    2. remove a large number of the unnecessary dependencies between the various parts of Windows, especially the circular dependencies.
    3. Both 1&2 #1 was the winning solution in large part because it could be executed by a small team over a defined period of time. #2 would have required herding all the Windows developers (and PMs, managers, UI designers...), and is potentially an unbounded problem.

    Someone else commented, to me, that they were on an offshoot team that got the one-way latency down from months to weeks. That's certainly an improvement, but why didn't anyone build a system like piper? I asked that question of people who were at Microsoft at the time, and I got answers like "when we started using perforce, it was so much faster than what we'd previously had that it didn't occur to people that we could do much better" and "perforce was so much faster than xcopy that it seemed like magic".

    This general phenomenon, where people don't attempt to make a major improvement because the current system is already such a huge improvement over the previous system, is something I'd seen before and even something I'd done before. This example happens to use Microsoft and Google, but please don't read too much into that. There are systems where things are flipped around and the system at Google is curiously unwieldy compared to the same system at Microsoft.

    [return]

Notes on concurrency bugs

2016-08-05 11:32:26

Do concurrency bugs matter? From the literature, we know that most reported bugs in distributed systems have really simple causes and can be caught by trivial tests, even when we only look at bugs that cause really bad failures, like loss of a cluster or data corruption. The filesystem literature echos this result -- a simple checker that looks for totally unimplemented error handling can find hundreds of serious data corruption bugs. Most bugs are simple, at least if you measure by bug count. But if you measure by debugging time, the story is a bit different.

Just from personal experience, I've spent more time debugging complex non-deterministic failures than all other types of bugs combined. In fact, I've spent more time debugging some individual non-deterministic bugs (weeks or months) than on all other bug types combined. Non-deterministic bugs are rare, but they can be extremely hard to debug and they're a productivity killer. Bad non-deterministic bugs take so long to debug that relatively large investments in tools and prevention can be worth it1.

Let's see what the academic literature has to say on non-deterministic bugs. There's a lot of literature out there, so let's narrow things down by looking at one relatively well studied area: concurrency bugs. We'll start with the literature on single-machine concurrency bugs and then look at distributed concurrency bugs.

Fonseca et al. DSN '10

They studied MySQL concurrency bugs from 2003 to 2009 and found the following:

More non-deadlock bugs (63%) than deadlock bugs (40%)

Note that these numbers sum to more than 100% because some bugs are tagged with multiple causes. This is roughly in line with the Lu et al. ASPLOS '08 paper (which we'll look at later), which found that 30% of the bugs they examined were deadlock bugs.

15% of examined failures were semantic

The paper defines a semantic failure as one "where the application provides the user with a result that violates the intended semantics of the application". The authors also find that "the vast majority of semantic bugs (92%) generated subtle violations of application semantics". By their nature, these failures are likely to be undercounted -- it's pretty hard to miss a deadlock, but it's easy to miss subtle data corruption.

15% of examined failures were latent

The paper defines latent as bugs that "do not become immediately visible to users.". Unsurprisingly, the paper finds that latent failures are closely related to semantic failures; 92% of latent failures are semantic and vice versa. The 92% number makes this finding sound more precise than it really is -- it's just that 11 out of the 12 semantic failures are latent and vice versa. That could have easily been 11 out of 11 (100%) or 10 out of 12 (83%).

That's interesting, but it's hard to tell from that if the results generalize to projects that aren't databases, or even projects that aren't MySQL.

Lu et al. ASPLOS '08

They looked at concurrency bugs in MySQL, Firefox, OpenOffice, and Apache. Some of their findings are:

97% of examined non-deadlock bugs were atomicity-violation or order-violation bugs

Of the 74 non-deadlock bugs studied, 51 were atomicity bugs, 24 were ordering bugs, and 2 were categorized as "other".

An example of an atomicity violation is this bug from MySQL:

Thread 1:

if (thd->proc_info)
  fputs(thd->proc_info, ...)

Thread 2:

thd->proc_info = NULL;

For anyone who isn't used to C or C++, thd is a pointer, and -> is the operator to access a field through a pointer. The first line in thread 1 checks if the field is null. The second line calls fputs, which writes the field. The intent is to only call fputs if and only if proc_info isn't NULL, but there's nothing preventing another thread from setting proc_info to NULL "between" the first and second lines of thread 1.

Like most bugs, this bug is obvious in retrospect, but if we look at the original bug report, we can see that it wasn't obvious at the time:

Description: I've just noticed with the latest bk tree than MySQL regularly crashes in InnoDB code ... How to repeat: I've still no clues on why this crash occurs.

As is common with large codebases, fixing the bug once it was diagnosed was more complicated than it first seemed. This bug was partially fixed in 2004, resurfaced again and was fixed in 2008. A fix for another bug caused a regression in 2009, which was also fixed in 2009. That fix introduced a deadlock that was found in 2011.

An example ordering bug is the following bug from Firefox:

Thread 1:

mThread=PR_CreateThread(mMain, ...);

Thread 2:

void mMain(...) {
  mState = mThread->State;
  }

Thread 1 launches Thread 2 with PR_CreateThread. Thread 2 assumes that, because the line that launched it assigned to mThread, mThread is valid. But Thread 2 can start executing before Thread 1 has assigned to mThread! The authors note that they call this an ordering bug and not an atomicity bug even though the bug could have been prevented if the line in thread 1 were atomic because their "bug pattern categorization is based on root cause, regardless of possible fix strategies".

An example of an "other" bug, one of only two studied, is this bug in MySQL:

Threads 1...n:

rw_lock(&lock);

Watchdog thread:

if (lock_wait_time[i] > fatal_timeout)
  assert(0);

This can cause a spurious crash when there's more than the expected amount of work. Note that the study doesn't look at performance bugs, so a bug where lock contention causes things to slow to a crawl but a watchdog doesn't kill the program wouldn't be considered.

An aside that's probably a topic for another post is that hardware often has deadlock or livelock detection built in, and that when a lock condition is detected, hardware will often try to push things into a state where normal execution can continue. After detecting and breaking deadlock/livelock, an error will typically be logged in a way that it will be noticed if it's caught in lab, but that external customers won't see. For some reason, that strategy seems rare in the software world, although it seems like it should be easier in software than in hardware.

Deadlock occurs if and only if the following four conditions are true:

  1. Mutual exclusion: at least one resource must be held in a non-shareable mode. Only one process can use the resource at any given instant of time.
  2. Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
  3. No preemption: a resource can be released only voluntarily by the process holding it.
  4. Circular wait: a process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.

There's nothing about these conditions that are unique to either hardware or software, and it's easier to build mechanisms that can back off and replay to relax (2) in software than in hardware. Anyway, back to the study findings.

96% of examined concurrency bugs could be reproduced by fixing the relative order of 2 specific threads

This sounds like great news for testing. Testing only orderings between thread pairs is much more tractable than testing all orderings between all threads. Similarly, 92% of examined bugs could be reproduced by fixing the order of four (or fewer) memory accesses. However, there's a kind of sampling bias here -- only bugs that could be reproduced could be analyzed for a root cause, and bugs that only require ordering between two threads or only a few memory accesses are easier to reproduce.

97% of examined deadlock bugs were caused by two threads waiting for at most two resources

Moreover, 22% of examined deadlock bugs were caused by a thread acquiring a resource held by the thread itself. The authors state that pairwise testing of acquisition and release sequences should be able to catch most deadlock bugs, and that pairwise testing of thread orderings should be able to catch most non-deadlock bugs. The claim seems plausibly true when read as written; the implication seems to be that virtually all bugs can be caught through some kind of pairwise testing, but I'm a bit skeptical of that due to the sample bias of the bugs studied.

I've seen bugs with many moving parts take months to track down. The worst bug I've seen consumed nearly a person-year's worth of time. Bugs like that mostly don't make it into studies like this because it's rare that a job allows someone the time to chase bugs that elusive. How many bugs like that are out there is still an open question.

Caveats

Note that all of the programs studied were written in C or C++, and that this study predates C++11. Moving to C++11 and using atomics and scoped locks would probably change the numbers substantially, not to mention moving to an entirely different concurrency model. There's some academic work on how different concurrency models affect bug rates, but it's not really clear how that work generalizes to codebases as large and mature as the ones studied, and by their nature, large and mature codebases are hard to do randomized trials on when the trial involves changing the fundamental primitives used. The authors note that 39% of examined bugs could have been prevented by using transactional memory, but it's not clear how many other bugs might have been introduced if transactional memory were used.

Tools

There are other papers on characterizing single-machine concurrency bugs, but in the interest of space, I'm going to skip those. There are also papers on distributed concurrency bugs, but before we get to that, let's look at some of the tooling for finding single-machine concurrency bugs that's in the literature. I find the papers to be pretty interesting, especially the model checking work, but realistically, I'm probably not going to build a tool from scratch if something is available, so let's look at what's out there.

HapSet

Uses run-time coverage to generate interleavings that haven't been covered yet. This is out of NEC labs; googling NEC labs HapSet returns the paper, some patent listings, but no obvious download for the tool.

CHESS

Generates unique interleavings of threads for each run. They claim that, by not tracking state, the checker is much simpler than it would otherwise be, and that they're able to avoid many of the disadvantages of tracking state via a detail that can't properly be described in this tiny little paragraph; read the paper if you're interested! Supports C# and C++. The page claims that it requires Visual Studio 2010 and that it's only been tested with 32-bit code. I haven't tried to run this on a modern *nix compiler, but IME requiring Visual Studio 2010 means that it would be a moderate effort to get it running on a modern version of Visual Studio, and a substantial effort to get it running on a modern version of gcc or clang. A quick Google search indicates that this might be patent encumbered2.

Maple

Uses coverage to generate interleavings that haven't been covered yet. Instruments pthreads. The source is up on GitHub. It's possible this tool is still usable, and I'll probably give it a shot at some point, but it depends on at least one old, apparently unmaintained tool (PIN, a binary instrumentation tool from Intel). Googling (Binging?) for either Maple or PIN gives a number of results where people can't even get the tool to compile, let alone use the tool.

PACER

Samples using the FastTrack algorithm in order to keep overhead low enough "to consider in production software". Ironically, this was implemented on top of the Jikes RVM, which is unlikely to be used in actual production software. The only reference I could find for an actually downloadable tool is a completely different pacer.

ConLock / MagicLock / MagicFuzzer

There's a series of tools that are from one group which claims to get good results using various techniques, but AFAICT the source isn't available for any of the tools. There's a page that claims there's a version of MagicFuzzer available, but it's a link to a binary that doesn't specify what platform the binary is for and the link 404s.

OMEN / WOLF

I couldn't find a page for these tools (other than their papers), let alone a download link.

SherLock / AtomChase / Racageddon

Another series of tools that aren't obviously available.

Tools you can actually easily use

Valgrind / DRD / Helgrind

Instruments pthreads and easy to use -- just run valgrind with the appropriate options (-drd or -helgrind) on the binary. May require a couple tweaks if using C++11 threading.

clang thread sanitizer (TSan)

Can find data races. Flags when happens-before is violated. Works with pthreads and C++11 threads. Easy to use (just pass a -fsanitize=thread to clang).

A side effect of being so easy to use and actually available is that tsan has had a very large impact in the real world:

One interesting incident occurred in the open source Chrome browser. Up to 15% of known crashes were attributed to just one bug [5], which proved difficult to understand - the Chrome engineers spent over 6 months tracking this bug without success. On the other hand, the TSAN V1 team found the reason for this bug in a 30 minute run, without even knowing about these crashes. The crashes were caused by data races on a couple of reference counters. Once this reason was found, a relatively trivial fix was quickly made and patched in, and subsequently the bug was closed.

clang -Wthread-safety

Static analysis that uses annotations on shared state to determine if state wasn't correctly guarded.

FindBugs

General static analysis for Java with many features. Has @GuardedBy annotations, similar to -Wthread-safety.

CheckerFramework

Java framework for writing checkers. Has many different checkers. For concurrency in particular, uses @GuardedBy, like FindBugs.

rr

Deterministic replay for debugging. Easy to get and use, and appears to be actively maintained. Adds support for time-travel debugging in gdb.

DrDebug/PinPlay

General toolkit that can give you deterministic replay for debugging. Also gives you "dynamic slicing", which is watchpoint-like: it can tell you what statements affected a variable, as well as what statements are affected by a variable. Currently Linux only; claims Windows and Android support coming soon.

Other tools

This isn't an exhaustive list -- there's a ton of literature on this, and this is an area where, frankly, I'm pretty unlikely to have the time to implement a tool myself, so there's not much value for me in reading more papers to find out about techniques that I'd have to implement myself3. However, I'd be interested in hearing about other tools that are usable.

One thing I find interesting about this is that almost all of the papers for the academic tools claim to do something novel that lets them find bugs not found by other tools. They then run their tool on some codebase and show that the tool is capable of finding new bugs. But since almost no one goes and runs the older tools on any codebase, you'd never know if one of the newer tools only found a subset of the bugs that one of the older tools could catch.

Furthermore, you see cycles (livelock?) in how papers claim to be novel. Paper I will claim that it does X. Paper II will claim that it's novel because it doesn't need to do X, unlike Paper I. Then Paper III will claim that it's novel because, unlike Paper II, it does X.

Distributed systems

Now that we've looked at some of the literature on single-machine concurrency bugs, what about distributed concurrency bugs?

Leesatapornwongsa et al. ASPLOS 2016

They looked at 104 bugs in Cassandra, MapReduce, HBase, and Zookeeper. Let's look at some example bugs, which will clarify the terminology used in the study and make it easier to understanding the main findings.

Message-message race

This diagram is just for reference, so that we have a high-level idea of how different parts fit together in MapReduce:

Block diagram of MapReduce

In MapReduce bug #3274, a resource manager sends a task-init message to a node manager. Shortly afterwards, an application master sends a task-kill preemption to the same node manager. The intent is for the task-kill message to kill the task that was started with the task-init message, but the task-kill can win the race and arrive before the task-init. This example happens to be a case where two messages from different nodes are racing to get to a single node.

For example, in MapReduce bug #5358, an application master sends a kill message to node manager running a speculative task because another copy of the task finished. However, before the message is received by the node manager, the node manager's task completes, causing a complete message to be sent to the application master, causing an exception because a complete message was received after the task had completed.

Message-compute race

One example is MapReduce bug# 4157, where the application master unregisters with the resource manager. The application master then cleans up, but that clean-up races against the resource manager sending kill messages to the application's containers via node managers, causing the application master to get killed. Note that this is classified as a race and not an atomicity bug, which we'll get to shortly.

Compute-compute races can happen, but they're outside the scope of this study since this study only looks at distributed concurrency bugs.

Atomicity violation

For the purposes of this study, atomicity bugs are defined as "whenever a message comes in the middle of a set of events, which is a local computation or global communication, but not when the message comes either before or after the events". According to this definition, the message-compute race we looked at above isn't a atomicity bug because it would still be a bug if the message came in before the "computation" started. This definition also means that hardware failures that occur inside a block that must be atomic are not considered atomicity bugs.

I can see why you'd want to define those bugs as separate types of bugs, but I find this to be a bit counterintuitive, since I consider all of these to be different kinds of atomicity bugs because they're different bugs that are caused by breaking up something that needs to be atomic.

In any case, by the definition of this study, MapReduce bug #5009 is an atomicty bug. A node manager is in the process of committing data to HDFS. The resource manager kills the task, which doesn't cause the commit state to change. Any time the node tries to rerun the commit task, the task is killed by the application manager because a commit is believed to already be in process.

Fault timing

A fault is defined to be a "component failure", such as a crash, timeout, or unexpected latency. At one point, the paper refers to "hardware faults such as machine crashes", which seems to indicate that some faults that could be considered software faults are defined as hardware faults for the purposes of this study.

Anyway, for the purposes of this study, an example of a fault-timing issue is MapReduce bug #3858. A node manager crashes while committing results. When the task is re-run, later attempts to commit all fail.

Reboot timing

In this study, reboots are classified separately from other faults. MapReduce bug #3186 illustrates a reboot bug.

A resource manager sends a job to an application master. If the resource manager is rebooted before the application master sends a commit message back to the resource manager, the resource manager loses its state and throws an exception because it's getting an unexpected complete message.

Some of their main findings are:

47% of examined bugs led to latent failures

That's a pretty large difference when compared to the DSN' 10 paper that found that 15% of examined multithreading bugs were latent failures. It's plausible that this is a real difference and not just something due to a confounding variable, but it's hard to tell from the data.

This is a large difference from what studies on "local" concurrency bugs found. I wonder how much of that is just because people mostly don't even bother filing and fixing bugs on hardware faults in non-distributed software.

64% of examined bugs were triggered by a single message's timing

44% were ordering violations, and 20% were atomicity violations. Furthermore, > 90% of bugs involved three messages (or fewer).

32% of examined bugs were due to fault or reboot timing. Note that, for the purposes of the study, a hardware fault or a reboot that breaks up a block that needed to be atomic isn't considered an atomicity bug -- here, atomicity bugs are bugs where a message arrives in the middle of a computation that needs to be atomic.

70% of bugs had simple fixes

30% were fixed by ignoring the badly timed message and 40% were fixed by delaying or ignoring the message.

Bug causes?

After reviewing the bugs, the authors propose common fallacies that lead to bugs:

  1. One hop is faster than two hops
  2. Zero hops are faster than one hop
  3. Atomic blocks can't be broken

On (3), the authors note that it's not just hardware faults or reboots that break up atomic blocks -- systems can send kill or pre-emption messages that break up an atomic block. A fallacy which I've commonly seen in post-mortems that's not listed here, goes something like "bad nodes are obviously bad". A classic example of this is when a system starts "handling" queries by dropping them quickly, causing a load balancer to shift traffic the bad node because it's handling traffic so quickly.

One of my favorite bugs in this class from an actual system was in a ring-based storage system where nodes could do health checks on their neighbors and declare that their neighbors should be dropped if the health check fails. One node went bad, dropped all of its storage, and started reporting its neighbors as bad nodes. Its neighbors noticed that the bad node was bad, but because the bad node had dropped all of its storage, it was super fast and was able to report its good neighbors before the good neighbors could report the bad node. After ejecting its immediate neighbors, the bad node got new neighbors and raced the new neighbors, winning again for the same reason. This was repeated until the entire cluster died.

Tools

Mace

A set of language extensions (on C++) that helps you build distributed systems. Mace has a model checker that can check all possible event orderings of messages, interleaved with crashes, reboots, and timeouts. The Mace model checker is actually available, but AFAICT it requires using the Mace framework, and most distributed systems aren't written in Mace.

Modist

Another model checker that checks different orderings. Runs only one interleaving of independent actions (partial order reduction) to avoid checking redundant states. Also interleaves timeouts. Unlike Mace, doesn't inject reboots. Doesn't appear to be available.

Demeter

Like Modist, in that it's a model checker that injects the same types of faults. Uses a different technique to reduce the state space, which I don't know how to summarize succinctly. See paper for details. Doesn't appear to be available. Googling for Demeter returns some software used to model X-ray absorption?

SAMC

Another model checker. Can inject multiple crashes and reboots. Uses some understanding of the system to avoid redundant re-orderings (e.g., if a series of messages is invariant to when a reboot is injected, the system tries to avoid injecting the reboot between each message). Doesn't appear to be available.

Jepsen

As was the case for non-distributed concurrency bugs, there's a vast literature on academic tools, most of which appear to be grad-student code that hasn't been made available.

And of course there's Jepsen, which doesn't have any attached academic papers, but has probably had more real-world impact than any of the other tools because it's actually available and maintained. There's also chaos monkey, but if I'm understanding it correctly, unlike the other tools listed, it doesn't attempt to create reproducible failures.

Conclusion

Is this where you're supposed to have a conclusion? I don't have a conclusion. We've looked at some literature and found out some information about bugs that's interesting, but not necessarily actionable. We've read about tools that are interesting, but not actually available. And then there are some tools based on old techniques that are available and useful.

For example, the idea inside clang's TSan, using "happens-before" to find data races, goes back ages. There's a 2003 paper that discusses "combining two previously known race detection techniques -- lockset-based detection and happens-before-based detection -- to obtain fewer false positives than lockset-based detection alone". That's actually what TSan v1 did, but with TSan v2 they realized the tool would be more impactful if they only used happens-before because that avoids false positives, which means that people will actually use the tool. That's not something that's likely to turn into a paper that gets cited zillions of times, though. For anyone who's looked at how afl works, this story should sound familiar. AFL is emintently practical and has had a very large impact in the real world, mostly by eschewing fancy techniques from the recent literature.

If you must have a conclusion, maybe the conclusion is that individuals like Kyle Kingsbury or Michal Zalewski have had an outsized impact on industry, and that you too can probably pick an underserved area in testing and have a curiously large impact on an entire industry.

Unrelated miscellania

Rose Ames asked me to tell more "big company" stories, so here's a set of stories that explains why I haven't put a blog post up for a while. The proximal cause is that my VP has been getting negative comments about my writing. But the reasons for that are a bit of a long story. Part of it is the usual thing, where the comments I receive personally skew very heavily positive, but the comments my manager gets run the other way because it's weird to email someone's manager because you like their writing, but you might send an email if their writing really strikes a nerve.

That explains why someone in my management chain was getting emailed about my writing, but it doesn't explain why the emails went to my VP. That's because I switched teams a few months ago, and the org that I was going to switch into overhired and didn't have any headcount. I've heard conflicting numbers about how much they overhired, from 10 or 20 people to 10% or 20% (the org is quite large, and 10% would be much more than 20), as well as conflicting stories about why it happened (honest mistake vs. some group realizing that there was a hiring crunch coming and hiring as much as possible to take all of the reqs from the rest of the org). Anyway, for some reason, the org I would have worked in hired more than it was allowed to by at least one person and instituted a hiring freeze. Since my new manager couldn't hire me into that org, he transferred into an org that had spare headcount and hired me into the new org. The new org happens to be a sales org, which means that I technically work in sales now; this has some impact on my day-to-day life since there are some resources and tech talks that are only accessible by people in product groups, but that's another story. Anyway, for reasons that I don't fully understand, I got hired into the org before my new manager, and during the months it took for the org chart to get updated I was shown as being parked under my VP, which meant that anyone who wanted to fire off an email to my manager would look me up in the directory and accidentally email my VP instead.

It didn't seem like any individual email was a big deal, but since I don't have much interaction with my VP and I don't want to only be known as that guy who writes stuff which generates pushback from inside the company, I paused blogging for a while. I don't exactly want to be known that way to my manager either, but I interact with my manager frequently enough that at least I won't only be known for that.

I also wonder if these emails to my manager/VP are more likely at my current employer than at previous employers. I've never had this happen (that I know of) at another employer, but the total number of times it's happened here is low enough that it might just be coincidence.

Then again, I was just reading the archives of a really insightful internal blog and ran across a note that mentioned that the series of blog posts was being published internally because the author got static from Sinofsky about publishing posts that contradicted the party line, which eventually resulted in the author agreeing to email Sinofsky comments related to anything under Sinofsky's purview instead of publishing the comments publicly. But now that Sinofsky has moved on, the author wanted to share emails that would have otherwise been posts internally.

That kind of thing doesn't seem to be a freak occurance around here. At the same time I saw that thing about Sinofsky, I ran across a discussion on whether or not a PM was within their rights to tell someone to take down a negative review from the app store. Apparently, a PM found out that someone had written a negative rating on the PM's product in some app store and emailed the rater, telling them that they had to take the review down. It's not clear how the PM found out that the rater worked for us (do they search the internal directory for every negative rating they find?), but they somehow found out and then issued their demand. Most people thought that the PM was out of line, but there were a non-zero number of people (in addition to the PM) who thought that employees should not say anything that could be construed as negative about the company in public.

I feel like I see more of this kind of thing now than I have at other companies, but the company's really too big to tell if anyone's personal experience generalizes. Anyway, I'll probably start blogging again now that the org chart shows that I report to my actual manager, and maybe my manager will get some emails about that. Or maybe not.

Thanks to Leah Hanson, David Turner, Justin Mason, Joe Wilder, Matt Dziubinski, Alex Blewitt, Bruno Kim Medeiros Cesar, Luke Gilliam, Ben Karas, Julia Evans, Michael Ernst, and Stephen Tu for comments/corrections.


  1. If you're going to debug bugs. I know some folks at startups who give up on bugs that look like they'll take more than a few hours to debug because their todo list is long enough that they can't afford the time. That might be the right decision given the tradeoffs they have, but it's not the right decision for everyone. [return]
  2. Funny thing about US patent law: you owe treble damages for willfully infringing on a patent. A direct effect of this is that two out of three of my full-time employers have very strongly recommended that I don't read patents, so I avoid reading patents that aren't obviously frivolous. And by frivolous, I don't mean patents for obvious things that any programmer might independently discover, because patents like that are often upheld as valid. I mean patents for things like how to swing on a swing. [return]
  3. I get the incentives that lead to this, and I don't begrudge researchers for pursuing career success by responding to those incentives, but as a lowly practitioner, it sure would be nice if the incentives were different. [return]

Some programming blogs to consider reading

2016-04-18 15:06:34

This is one of those “N technical things every programmer must read” lists, except that “programmer” is way too broad a term and the styles of writing people find helpful for them are too different for any such list to contain a non-zero number of items (if you want the entire list to be helpful to everyone). So here's a list of some things you might want to read, and why you might (or might not) want to read them.

Aleksey Shipilev

If you want to understand how the JVM really works, this is one of the best resources on the internet.

Bruce Dawson

Performance explorations of a Windows programmer. Often implicitly has nice demonstrations of tooling that has no publicly available peer on Linux.

Chip Huyen

A mix of summaries of ML conferences, data analyses (e.g., on interview data posted to glassdoor or compensation data posted to levels.fyi), and generaly commentary on the industry.

One of the rare blogs that has data-driven position pieces about the industry.

Chris Fenton

Computer related projects, by which I mean things like reconstructing the Cray-1A and building mechanical computers. Rarely updated, presumably due to the amount of work that goes into the creations, but almost always interesting.

The blog posts tend to be high-level, more like pitch decks than design docs, but there's often source code available if you want more detail.

Cindy Sridharan

More active on Twitter than on her blog, but has posts that review papers as well as some on "big" topics, like distributed tracing and testing in production.

Dan McKinley

A lot of great material on how engineering companies should be run. He has a lot of ideas that sound like common sense, e.g., choose boring technology, until you realize that it's actually uncommon to find opinions that are so sensible.

Mostly distilled wisdom (as opposed to, say, detailed explanations of code).

Eli Bendersky

I think of this as “the C++ blog”, but it's much wider ranging that that. It's too wide ranging for me to sum up, but if I had to commit to a description I might say that it's a collection of deep dives into various topics, often (but not always) relatively low-level, along with short blurbs about books, often (but not always) technical.

The book reviews tend to be easy reading, but the programming blog posts are often a mix of code and exposition that really demands your attention; usually not a light read.

Erik Sink

I think Erik has been the most consistently insightful writer about tech culture over the past 20 years. If you look at people who were blogging back when he started blogging, much of Steve Yegge's writing holds up as well as Erik's, but Steve hasn't continued writing consistently.

If you look at popular writers from that era, I think they generally tend to not really hold up very well.

Fabian Giesen

Covers a wide variety of technical topics. Emphasis on computer architecture, compression, graphics, and signal processing, but you'll find many other topics as well.

Posts tend towards being technically intense and not light reading and they usually explain concepts or ideas (as opposed to taking sides and writing opinion pieces).

Fabien Sanglard

In depth techincal dives on game related topics, such as this readthrough of the Doom source code, this history of Nvidia GPU architecture, or this read of a business card raytracer.

Fabrice Bellard

Not exactly a blog, but every time a new project appears on the front page, it's amazing. Some examples are QEMU, FFMPEG, a 4G LTE base station that runs on a PC, a JavaScript PC emulator that can boot Linux, etc.

Fred Akalin

Explanations of CS-related math topics (with a few that aren't directly CS related).

Gary Bernhardt

Another “not exactly a blog”, but it's more informative than most blogs, not to mention more entertaining. This is the best “blog” on the pervasive brokenness of modern software that I know of.

Jaana Dogan

rakyll.org has posts on Go, some of which are quite in depth, e.g., this set of notes on the Go generics proposal and Jaana's medium blog has some posts on Go as well as posts on various topics in distributed systems.

Also, Jaana's Twitter has what I think of as "intellectually honest critiques of the industry", which I think is unusual for critiques of the industry on Twitter. It's more typical to see people scoring points at the expense of nuance or even being vaugely in the vicinity of correctness, which is why I think it's worth calling out these honest critiques.

Jamie Brandon

I'm so happy that I managed to convince Jamie that, given his preferences, it would make sense to take a crack at blogging full-time to support himself. Since Jamie started taking donations until today, this blog has been an absolute power house with posts like this series on problems with SQL, this series on streaming systems, great work on technical projects like dida and imp, etc.

It remains to be seen whether or not Jamie will be able to convince me to try blogging as a full-time job.

Janet Davis

This is the story of how a professor moved from Grinnel to Whitman and started a CS program from scratch. The archives are great reading if you're interested in how organizations form or CS education.

Jeff Preshing

Mostly technical content relating to C++ and Python, but also includes topics that are generally useful for programmers, such as read-modify-write operations, fixed-point math, and memory models.

Jessica Kerr

Jessica is probably better known for her talks than her blog? Her talks are great! My favorite is probably this talk with explains different concurrency models in an easy to understand way, but the blog also has a lot of material I like.

As is the case with her talks, the diagrams often take a concept and clarify it, making something that wasn't obvious seem very obvious in retrospect.

John Regehr

I think of this as the “C is harder than you think, even if you think C is really hard” blog, although the blog actually covers a lot more than that. Some commonly covered topics are fuzzing, compiler optimization, and testing in general.

Posts tend to be conceptual. When there are code examples, they're often pretty easy to read, but there are also examples of bizzaro behavior that won't be easy to skim unless you're someone who knows the C standard by heart.

Juho Snellman

A lot of posts about networking, generally written so that they make sense even with minimal networking background. I wish more people with this kind of knowledge (in depth knowledge of systems, not just networking knowledge in particular) would write up explanations for a general audience. Also has interesting non-networking content, like this post on Finnish elections.

Julia Evans

AFAICT, the theme is “things Julia has learned recently”, which can be anything from Huffman coding to how to be happy when working in a remote job. When the posts are on a topic I don't already know, I learn something new. When they're on a topic I know, they remind me that the topic is exciting and contains a lot of wonder and mystery.

Many posts have more questions than answers, and are more of a live-blogged exploration of a topic than an explanation of the topic.

Karla Burnett

A mix of security-related topics and explanations of practical programming knowledge. This article on phishing, which includes a set of fun case studies on how effective phising can be, even after people take anti-phishing training, is an example of a security post. This post on printing out text via tracert. This post on writing an SSH client and this post on some coreutils puzzles are examples of practical programming explanations.

Although the blog is security oriented, posts are written for a general audience and don't assume specific expertise in security.

Kate Murphy

Mostly small, self-contained explorations like, what's up with this Python integer behavior, how do you make a git blow up with a simple repo, or how do you generate hash collisions in Lua?

Kavya Joshi

I generally prefer technical explanations in text over video, but her exposition is so clear that I'm putting these talks in this list of blogs. Some examples include an explanation of the go race detector, simple math that's handy for performance modeling, and time.

Kyle Kingsbury

90% of Kyle's posts are explanations of distributed systems testing, which expose bugs in real systems that most of us rely on. The other 10% are musings on programming that are as rigorous as Kyle's posts on distributed systems. Possibly the most educational programming blog of all time.

For those of us without a distributed systems background, understanding posts often requires a bit of Googling, despite the extensive explanations in the posts. Most new posts are now at jepsen.io

Laura Lindzey

Very infrequently updated (on the order of once a year) with explanations of things Laura has been working on, from Oragami PCB to Ice-Penetrating Radar.

Laurie Tratt

This blog has been going since 2004 and its changed over the years. Recently, it's had some of the best posts on benchmarking around:

  • VM performance, part 1
    • Thoroughly refutes the idea that you can run a language VM for some warmup period and then take some numbers when they become stable
  • VM performance, part 2
  • Why not use minimum times when benchmarking
    • "Everyone" who's serious about performance knows this and it's generally considered too obvious to write up, but this is still a widely used technique in benchmarking even though it's only appropriate in limited circumstances

The blog isn't purely technical, this blog post on advice is also stellar. If those posts don't sound interesting to you, it's worth checking out the archives to see if some of the topics Lawrence used to write about more frequently are to your taste.

Marc Brooker

A mix of theory and wisdom from a distributed systems engineer on EBS at Amazon. The theory posts tend to be relatively short and easy to swallow; not at all intimidating, as theory sometimes is.

Marek Majkowski

This used to be a blog about random experiments Marek was doing, like this post on bitsliced SipHash. Since Marek joined Cloudflare, this has turned into a list of things Marek has learned while working in Cloudflare's networking stack, like this story about debugging slow downloads.

Posts tend to be relatively short, but with enough technical specifics that they're not light reads.

Nicole Express

Explorations on old systems, often gaming related. Some exmaples are this post on collision detection in Alf for the Sega Master System, this post on getting decent quality output from composite video, and this post on the Neo Geo CDZ.

Nikita Prokopov

Nikita has two blogs, both on related topics. The main blog has long-form articles, often how about modern software is terrible. THen there's grumpy.website, which gives examples of software being terrible.

Nitsan Wakart

More than you ever wanted to know about writing fast code for the JVM, from GV affects data structures to the subtleties of volatile reads.

Posts tend to involve lots of Java code, but the takeaways are often language agnostic.

Oona Raisanen

Adventures in signal processing. Everything from deblurring barcodes to figuring out what those signals from helicopters mean. If I'd known that signals and systems could be this interesting, I would have paid more attention in class.

Paul Khuong

Some content on Lisp, and some on low-level optimizations, with a trend towards low-level optimizations.

Posts are usually relatively long and self-contained explanations of technical ideas with very little fluff.

Rachel Kroll

Years of debugging stories from a long-time SRE, along with stories about big company nonsense. Many of the stories come from Lyft, Facebook, and Google. They're anonymized, but if you know about the companies, you can tell which ones are which.

The degree of anonymization often means that the stories won't really make sense unless you're familiar with the operation of systems similar to the ones in the stories.

Sophie Haskins

A blog about restoring old "pizza box" computers, with posts that generally describe the work that goes into getting these machines working again.

An example is the HP 712 ("low cost" PA-RISC workstations that went for roughly $5k to $15k in 1994 dollars, which ended up doomed due to the Intel workstation onslaught that started with the Pentium Pro in 1995), where the restoration process is described here in part 1 and then here in part 2.

Vyacheslav Egorov

In-depth explanations on how V8 works and how various constructs get optimized by a compiler dev on the V8 team. If I knew compilers were this interesting, I would have taken a compilers class back when I was in college.

Often takes topics that are considered hard and explains them in a way that makes them seem easy. Lots of diagrams, where appropriate, and detailed exposition on all the tricky bits.

whitequark

Her main site has to a variety of interesting tools she's made or worked on, many of which are FPGA or open hardware related, but some of which are completely different. Whitequark's lab notebook has a really wide variety of different results, from things like undocumented hardware quirks, to fairly serious home chemistry experiments, to various tidbits about programming and hardware development (usually low level, but not always).

She's also fairly active on twitter, with some commentary on hardware/firmware/low-level programming combined with a set of diverse topics that's too broad to easily summarize.

Yossi Kreinin

Mostly dormant since the author started doing art, but the archives have a lot of great content about hardware, low-level software, and general programming-related topics that aren't strictly programming.

90% of the time, when I get the desire to write a post about a common misconception software folks have about hardware, Yossi has already written the post and taken a lot of flak for it so I don't have to :-).

I also really like Yossi's career advice, like this response to Patrick McKenzie and this post on how managers get what they want and not what they ask for.

He's active on Twitter, where he posts extremely cynical and snarky takes on management and the industry.

This blog?

Common themes include:

The end

This list also doesn't include blogs that mostly aren't about programming, so it doesn't include, for example, Ben Kuhn's excellent blog.

Anyway, that's all for now, but this list is pretty much off the top of my head, so I'll add more as more blogs come to mind. I'll also keep this list updated with what I'm reading as I find new blogs. Please please please suggest other blogs I might like, and don't assume that I already know about a blog because it's popular. Just for example, I had no idea who either Jeff Atwood or Zed Shaw were until a few years ago, and they were probably two of the most well known programming bloggers in existence. Even with centralized link aggregators like HN and reddit, blog discovery has become haphazard and random with the decline of blogrolls and blogging as a dialogue, as opposed to the current practice of blogging as a monologue. Also, please don't assume that I don't want to read something just because it's different from the kind of blog I normally read. I'd love to read more from UX or front-end folks; I just don't know where to find that kind of thing!

Last update: 2021-07

Archive

Here are some blogs I've put into an archive section because they rarely or never update.

Alex Clemmer

This post on why making a competitor to Google search is a post in classic Alex Clemmer style. The post looks at a position that's commonly believed (web search isn't all that hard and someone should come up with a better Google) and explains why that's not an obviously correct position. That's also a common theme of his comments elsewhere, such as these comments on, stack ranking at MS, implementing POSIX on Windows, the size of the Windows codebase, Bond, and Bing.

He's sort of a modern mini-MSFT, in that it's incisive commentary on MS and MS related ventures.

Allison Kaptur

Explorations of various areas, often Python related, such as this this series on the Python interpreter and this series on the CPython peephole optimizer. Also, thoughts on broader topics like debugging and learning.

Often detailed, with inline code that's meant to be read and understood (with the help of exposition that's generally quite clear).

David Dalrymple

A mix of things from writing a 64-bit kernel from scratch shortly after learning assembly to a high-level overview of computer systems. Rarely updated, with few posts, but each post has a lot to think about.

EPITA Systems Lab

Low-level. A good example of a relatively high-level post from this blog is this post on the low fragmentation heap in Windows. Posts like how to hack a pinball machine and how to design a 386 compatible dev board are typical.

Posts are often quite detailed, with schematic/circuit diagrams. This is relatively heavy reading and I try to have pen and paper handy when I'm reading this blog.

Greg Wilson

Write-ups of papers that (should) have an impact on how people write software, like this paper on what causes failures in distributed systems or this paper on what makes people feel productive. Not updated much, but Greg still blogs on his personal site.

The posts tend to be extended abstracts that tease you into reading the paper, rather than detailed explanations of the methodology and results.

Gustavo Duarte

Explanations of how Linux works, as well as other low-level topics. This particular blog seems to be on hiatus, but "0xAX" seems to have picked up the slack with the linux-insides project.

If you've read Love's book on Linux, Duarte's explanations are similar, but tend to be more about the idea and less about the implementation. They're also heavier on providing diagrams and context. "0xAX" is a lot more focused on walking through the code than either Love or Duarte.

Huon Wilson

Explanations of various Rust-y things, from back when Huon was working on Rust. Not updated much anymore, but the content is still great for someone who's interested in technical tidbits related to Rust.

Kamal Marhubi

Technical explorations of various topics, with a systems-y bent. Kubernetes. Git push. Syscalls in Rust. Also, some musings on programming in general.

The technical explorations often get into enough nitty gritty detail that this is something you probably want to sit down to read, as opposed to skim on your phone.

Mary Rose Cook

Lengthy and very-detailed explanations of technical topics, mixed in with a wide variety of other posts.

The selection of topics is eclectic, and explained at a level of detail such that you'll come away with a solid understanding of the topic. The explanations are usually fine grained enough that it's hard to miss what's going on, even if you're a beginner programmer.

Rebecca Frankel

As far as I know, Rebecca doesn't have a programming blog, but if you look at her apparently off-the-cuff comments on other people's posts as a blog, it's one of the best written programming blogs out there. She used to be prolific on Piaw's Buzz (and probably elsewhere, although I don't know where), and you occasionally see comments elsewhere, like on this Steve Yegge blog post about brilliant engineers1. I wish I could write like that.

Russell Smith

Homemade electronics projects from vim on a mechanical typewriter to building an electrobalance to proof spirits.

Posts tend to have a fair bit of detail, down to diagrams explaining parts of circuits, but the posts aren't as detailed as specs. But there are usually links to resources that will teach you enough to reproduce the project, if you want.

RWT

I find the archives to be fun reading for insight into what people were thinking about microprocessors and computer architecture over the past two decades. It can be a bit depressing to see that the same benchmarking controversies we had 15 years ago are being repeated today, sometimes with the same players. If anything, I'd say that the average benchmark you see passed around today is worse than what you would have seen 15 years ago, even though the industry as a whole has learned a lot about benchmarking since then.

walpurgusriot

The author of walpurgisriot seems to have abanoned the github account and moved on to another user name (and a squatter appears to have picked up her old account name), but this used to be a semi-frequently updated blog with a combination of short explorations on programming and thoughts on the industry. On pure quality of prose, this is one of the best tech blogs I've ever read; the technical content and thoughts on the industry are great as well.

This post was inspired by the two posts Julia Evans has on blogs she reads and by the Chicago undergraduate mathematics bibliography, which I've found to be the most useful set of book reviews I've ever encountered.

Thanks to Bartłomiej Filipek and Sean Barrett, Michel Schniz, Neil Henning, and Lindsey Kuper for comments/discussion/corrections.


  1. Quote follows below, since I can see from my analytics data that relatively few people click any individual link, and people seem especially unlikely to click a link to read a comment on a blog, even if the comment is great:

    The key here is "principally," and that I am describing motivation, not self-evaluation. The question is, what's driving you? What gets you working? If its just trying to show that you're good, then you won't be. It has to be something else too, or it won't get you through the concentrated decade of training it takes to get to that level.

    Look at the history of the person we're all presuming Steve Yegge is talking about. He graduated (with honors) in 1990 and started at Google in 1999. So he worked a long time before he got to the level of Google's star. When I was at Google I hung out on Sunday afternoons with a similar superstar. Nobody else was reliably there on Sunday; but he always was, so I could count on having someone to talk to. On some Sundays he came to work even when he had unquestionably legitimate reasons for not feeling well, but he still came to work. Why didn't he go home like any normal person would? It wasn't that he was trying to prove himself; he'd done that long ago. What was driving him?

    The only way I can describe it is one word: fury. What was he doing every Sunday? He was reviewing various APIs that were being proposed as standards by more junior programmers, and he was always finding things wrong with them. What he would talk about, or rather, rage about, on these Sunday afternoons was always about some idiocy or another that someone was trying make standard, and what was wrong with it, how it had to be fixed up, etc, etc. He was always in a high dudgeon over it all.

    What made him come to work when he was feeling sick and dizzy and nobody, not even Larry and Sergey with their legendary impatience, not even them, I mean nobody would have thought less of him if he had just gone home & gone to sleep? He seemed to be driven, not by ambition, but by fear that if he stopped paying attention, something idiotically wrong (in his eyes) might get past him, and become the standard, and that was just unbearable, the thought made him so incoherently angry at the sheer wrongness of it, that he had to stay awake and prevent it from happening no matter how legitimately bad he was feeling at the time.

    It made me think of Paul Graham's comment: "What do I mean by good people? One of the best tricks I learned during our startup was a rule for deciding who to hire. Could you describe the person as an animal?... I mean someone who takes their work a little too seriously; someone who does what they do so well that they pass right through professional and cross over into obsessive.

    What it means specifically depends on the job: a salesperson who just won't take no for an answer; a hacker who will stay up till 4:00 AM rather than go to bed leaving code with a bug in it; a PR person who will cold-call New York Times reporters on their cell phones; a graphic designer who feels physical pain when something is two millimeters out of place."

    I think a corollary of this characterization is that if you really want to be "an animal," what you have cultivate in yourself is partly ambition, but it is partly also self-knowledge. As Paul Graham says, there are different kinds of animals. The obsessive graphic designer might be unconcerned about an API that is less than it could be, while the programming superstar might pass by, or create, a terrible graphic design without the slightest twinge of misgiving.

    Therefore, key question is: are you working on the thing you care about most? If its wrong, is it unbearable to you? Nothing but deep seated fury will propel you to the level of a superstar. Getting there hurts too much; mere desire to be good is not enough. If its not in you, its not in you. You have to be propelled by elemental wrath. Nothing less will do.

    Or it might be in you, but just not in this domain. You have to find what you care about, and not just what you care about, but what you care about violently: you can't fake it.

    (Also, if you do have it in you, you still have to choose your boss carefully. No matter how good you are, it may not be trivial to find someone you can work for. There's more to say here; but I'll have to leave it for another comment.)

    Another clarification of my assertion "if you're wondering if you're good, then you're not" should perhaps be said "if you need reassurance from someone else that you're good, then you're not." One characteristic of these "animals" is that they are such obsessive perfectionists that their own internal standards so far outstrip anything that anyone else could hold them to, that no ordinary person (i.e. ordinary boss) can evaluate them. As Steve Yegge said, they don't go for interviews. They do evaluate each other -- at Google the superstars all reviewed each other's code, reportedly brutally -- but I don't think they cared about the judgments of anyone who wasn't in their circle or at their level.

    I agree with Steve Yegge's assertion that there are an enormously important (small) group of people who are just on another level, and ordinary smart hardworking people just aren't the same. Here's another way to explain why there should be a quantum jump -- perhaps I've been using this discussion to build up this idea: its the difference between people who are still trying to do well on a test administered by someone else, and the people who have found in themselves the ability to grade their own test, more carefully, with more obsessive perfectionism, than anyone else could possibly impose on them.

    School, for all it teaches, may have one bad lasting effect on people: it gives them the idea that good people get A's on tests, and better ones get A+'s on tests, and the very best get A++'s. Then you get the idea that you go out into the real world, and your boss is kind of super-professor, who takes over the grading of the test. Joel Spolsky is accepting that role, being boss as super-professor, grading his employees tests for them, telling them whether they are good.

    But the problem is that in the real world, the very most valuable, most effective people aren't the ones who are trying to get A+++'s on the test you give them. The very best people are the ones who can make up their own test with harder problems on it than you could ever think of, and you'd have to have studied for the same ten years they have to be able even to know how to grade their answers.

    That's a problem, incidentally, with the idea of a meritocracy. School gives you an idea of a ladder of merit that reaches to the top. But it can't reach all the way to the top, because someone has to measure the rungs. At the top you're not just being judged on how high you are on the ladder. You're also being judged on your ability to "grade your own test"; that is to say, your trustworthiness. People start asking whether you will enforce your own standards even if no one is imposing them on you. They have to! because at the top people get given jobs with the kind of responsibility where no one can possibly correct you if you screw up. I'm giving you an image of someone who is working himself sick, literally, trying grade everyone else's work. In the end there is only so much he can do, and he does want to go home and go to bed sometimes. That means he wants people under him who are not merely good, but can be trusted not to need to be graded. Somebody has to watch the watchers, and in the end, the watchers have to watch themselves.

    [return]

Google SRE book

2016-04-11 16:00:58

The book starts with a story about a time Margaret Hamilton brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys that caused a prelaunch program to run during the simulated mission. Hamilton submitted a change request to add error checking code to prevent the error from happening again, but the request was rejected because the error case should never happen.

On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve.

This sounds familiar -- I’ve lost track of the number of dev post-mortems that have the same basic structure.

This is an experiment in note-taking for me in two ways. First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it's not obvious, asides from me are in italics.

Chapter 1: Introduction

Everything in this chapter is covered in much more detail later.

Two approaches to hiring people to manage system stability:

Traditional approach: sysadmins

  • Assemble existing components and deploy to produce a service
  • Respond to events and updates as they occur
  • Grow team to absorb increased work as service grows
  • Pros
    • Easy to implement because it’s standard
    • Large talent pool to hire from
    • Lots of available software
  • Cons
    • Manual intervention for change management and event handling causes size of team to scale with load on system
    • Ops is fundamentally at odds with dev, which can cause pathological resistance to changes, which causes similarly pathological response from devs, which reclassify “launches” as “incremental updates”, “flag flips”, etc.

Google’s approach: SREs

  • Have software engineers do operations
  • Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).
  • Career progress comparable to dev career track
  • Results
    • SREs would be bored by doing tasks by hand
    • Have the skillset necessary to automate tasks
    • Do the same work as an operations team, but with automation instead of manual labor
  • To avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of “ops” work for SREs
    • Upper bound. Actual amount of ops work is expected to be much lower
  • Pros
    • Cheaper to scale
    • Circumvents devs/ops split
  • Cons
    • Hard to hire for
    • May be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)

I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.

Tenets of SRE

  • SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning

Ensuring a durable focus on engineering

  • 50% ops cap means that extra ops work is redirected to product teams on overflow
  • Provides feedback mechanism to product teams as well as keeps load down
  • Target max 2 events per 8-12 hour on-call shift
  • Postmortems for all serious incidents, even if they didn’t trigger a page
  • Blameless postmortems

2 events per shift is the max, but what’s the average? How many on-call events are expected to get sent from the SRE team to the dev team per week?

How do you get from a blameful postmortem culture to a blameless postmortem culture? Now that everyone knows that you should have blameless postmortems, everyone will claim to do them. Sort of like having good testing and deployment practices. I’ve been lucky to be on an on call rotation that’s never gotten paged, but when I talk to folks who joined recently and are on call, they have not so great stories of finger pointing, trash talk, and blame shifting. The fact that everyone knows you’re supposed to be blameless seems to make it harder to call out blamefulness, not easier.

Move fast without breaking SLO

  • Error budget. 100% is the wrong reliability target for basically everything
  • Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort
  • Set a goal that acknowledges the trade-off and leaves an error budget
  • Error budget can be spent on anything: launching features, etc.
  • Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors
  • Goal of SRE team isn’t “zero outages” -- SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity

It’s not explicitly stated, but for teams that need to “move fast”, consistently coming in way under the error budget could be taken as a sign that the team is spending too much effort on reliability.

I like this idea a lot, but when I discussed this with Jessica Kerr, she pushed back on this idea because maybe you’re just under your error budget because you got lucky and a single really bad event can wipe out your error budget for the next decade. Followup question: how can you be confident enough in your risk model that you can purposefully consume error budget to move faster without worrying that a downstream (in time) bad event will put you overbudget? Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through simulated disasters and other testing.

Monitoring

  • Monitoring should never require a human to interpret any part of the alerting domain
  • Three valid kinds of monitoring output
    • Alerts: human needs to take action immediately
    • Tickets: human needs to take action eventually
    • Logging: no action needed
    • Note that, for example, graphs are a type of log

Emergency Response

  • Reliability is a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
  • For evaluating responses, we care about MTTR
  • Humans add latency
  • Systems that don’t require humans to respond will have higher availability due to lower MTTR
  • Having a “playbook” produces 3x lower MTTR
    • Having hero generalists who can respond to everything works, but having playbooks works better

I personally agree, but boy do we like our on call heros. I wonder how we can foster a culture of documentation.

Change management

  • 70% of outages due to changes in a live system. Mitigation:
    • Implement progressive rollouts
    • Monitoring
    • Rollback
  • Remove humans from the loop, avoid standard human problems on repetitive tasks

Demand forecasting and capacity planning

  • Straightforward, but a surprising number of teams/services don’t do it

Provisioning

  • Adding capacity riskier than load shifting, since it often involves spinning up new instances/locations, making significant changes to existing systems (config files, load balancers, etc.)
  • Expensive enough that it should be done only when necessary; must be done quickly
    • If you don’t know what you actually need and overprovision that costs money

Efficiency and performance

  • Load slows down systems
  • SREs provision to meet capacity target with a specific response time goal
  • Efficiency == money

Chapter 2: The production environment at Google, from the viewpoint of an SRE

No notes on this chapter because I’m already pretty familiar with it. TODO: maybe go back and read this chapter in more detail.

Chapter 3: Embracing risk

  • Ex: if a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability

Managing risk

  • Reliability isn’t linear in cost. It can easily cost 100x more to get one additional increment of reliability
    • Cost associated with redundant equipment
    • Cost of building out features for reliability as opposed to “normal” features
    • Goal: make systems reliable enough, but not too reliable!

Measuring service risk

  • Standard practice: identify metric to represent property of system to optimize
  • Possible metric = uptime / (uptime + downtime)
    • Problematic for a globally distributed service. What does uptime really mean?
  • Aggregate availability = successful requests / total requests
    • Obv, not all requests are equal, but aggregate availability is an ok first order approximation
  • Usually set quarterly targets

Risk tolerance of services

  • Usually not objectively obvious
  • SREs work with product owners to translate business objectives into explicit objectives

Identifying risk tolerance of consumer services

TODO: maybe read this in detail on second pass

Identifying risk tolerance of infrastructure services

Target availability
  • Running ex: Bigtable
    • Some consumer services serve data directly from Bigtable -- need low latency and high reliability
    • Some teams use bigtable as a backing store for offline analysis -- care more about throughput than reliability
  • Too expensive to meet all needs generically
    • Ex: Bigtable instance
    • Low-latency Bigtable user wants low queue depth
    • Throughput oriented Bigtable user wants moderate to high queue depth
    • Success and failure are diametrically opposed in these two cases!
Cost
  • Partition infra and offer different levels of service
  • In addition to obv. benefits, allows service to externalize the cost of providing different levels of service (e.g., expect latency oriented service to be more expensive than throughput oriented service)

Motivation for error budgets

No notes on this because I already believe all of this. Maybe go back and re-read this if involved in debate about this.

Chapter 4: Service level objectives

Note: skipping notes on terminology section.

  • Ex: Chubby planned outages
    • Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google
    • Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby
    • Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down

What do you and your users care about?

  • Too many indicators: hard to pay attention
  • Too few indicators: might ignore important behavior
  • Different classes of services should have different indicators
    • User-facing: availability, latency, throughput
    • Storage: latency, availability, durability
    • Big data: throughput, end-to-end latency
  • All systems care about correctness

Collecting indicators

  • Can often do naturally from server, but client-side metrics sometimes needed.

Aggregation

  • Use distributions and not averages
  • User studies show that people usually prefer slower average with better tail latency
  • Standardize on common defs, e.g., average over 1 minute, average over tasks in cluster, etc.
    • Can have exceptions, but having reasonable defaults makes things easier

Choosing targets

  • Don’t pick target based on current performance
    • Current performance may require heroic effort
  • Keep it simple
  • Avoid absolutes
    • Unreasonable to talk about “infinite” scale or “always” available
  • Minimize number of SLOs
  • Perfection can wait
    • Can always redefine SLOs over time
  • SLOs set expectations
    • Keep a safety margin (internal SLOs can be defined more loosely than external SLOs)
  • Don’t overachieve
    • See Chubby example, above
    • Another example is making sure that the system isn’t too fast under light loads

Chapter 5: Eliminating toil

Carla Geisser: "If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."

  • Def: Toil
    • Not just “work I don’t want to do”
    • Manual
    • Repetitive
    • Automatable
    • Tactical
    • No enduring value
    • O(n) with service growth
  • In surveys, find 33% toil on average
    • Numbers can be as low as 0% and as high as 80%
    • Toil > 50% is a sign that the manager should spread toil load more evenly
  • Is toil always bad?
    • Predictable and repetitive tasks can be calming
    • Can produce a sense of accomplishment, can be low-risk / low-stress activities

Section on why toil is bad. Skipping notetaking for that section.

Chapter 6: Monitoring distributed systems

  • Why monitor?
    • Analyze long-term trends
    • Compare over time or do experiments
    • Alerting
    • Building dashboards
    • Debugging

As Alex Clemmer is wont to say, our problem isn’t that we move too slowly, it’s that we build the wrong thing. I wonder how we could get from where we are today to having enough instrumentation to be able to make informed decisions when building new systems.

Setting reasonable expectations

  • Monitoring is non-trivial
  • 10-12 person SRE team typically has 1-2 people building and maintaining monitoring
  • Number has decreased over time due to improvements in tooling/libs/centralized monitoring infra
  • General trend towards simpler/faster monitoring systems, with better tools for post hoc analysis
  • Avoid “magic” systems
  • Limited success with complex dependency hierarchies (e.g., “if DB slow, alert for DB, otherwise alert for website”).
    • Used mostly (only?) for very stable parts of system
  • Rules that generate alerts for humans should be simple to understand and represent a clear failure

Avoiding magic includes avoiding ML?

  • Lots of white-box monitoring
  • Some black-box monitoring for critical stuff
  • Four golden signals
    • Latency
    • Traffic
    • Errors
    • Saturation

Interesting examples from Bigtable and Gmail from chapter not transcribed. A lot of information on the importance of keeping alerts simple also not transcribed.

The long run

  • There’s often a tension between long-run and short-run availability
  • Can sometimes fix unreliable systems through heroic effort, but that’s a burnout risk and also a failure risk
  • Taking a controlled hit in short-term reliability is usually the better trade

Chapter 7: Evolution of automation at Google

  • “Automation is a force multiplier, not a panacea”
  • Value of automation
    • Consistency
    • Extensibility
    • MTTR
    • Faster non-repair actions
    • Time savings

Multiple interesting case studies and explanations skipped in notes.

Chapter 8: Release engineering

  • This is a specific job function at Google

Release engineer role

  • Release engineers work with SWEs and SREs to define how software is released
    • Allows dev teams to focus on dev work
  • Define best practices
    • Compiler flags, formats for build ID tags, etc.
  • Releases automated
  • Models vary between teams
    • Could be “push on green” and deploy every build
    • Could be hourly builds and deploys
    • etc.
  • Hermetic builds
    • Building same rev number should always give identical results
    • Self-contained -- this includes versioning everything down the compiler used
    • Can cherry-pick fixes against an old rev to fix production software
  • Virtually all changes require code review
  • Branching
    • All code in main branch
    • Releases are branched off
    • Fixes can go from master to branch
    • Branches never merged back
  • Testing
    • CI
    • Release process creates an audit trail that runs tests and shows that tests passed
  • Config management
  • Many possible schemes (all involve storing config in source control and having strict config review)
  • Use mainline for config -- config maintained at head and applied immediately
    • Originally used for Borg (and pre-Borg systems)
    • Binary releases and config changes decoupled!
  • Include config files and binaries in same package
    • Simple
    • Tightly couples binary and config -- ok for projects with few config files or where few configs change
  • Package config into “configuration packages”
    • Same hermetic principle as for code
  • Release engineering shouldn’t be an afterthought!
    • Budget resources at beginning of dev cycle

Chapter 9: Simplicity

  • Stability vs. agility
    • Can make things stable by freezing -- need to balance the two
    • Reliable systems can increase agility
    • Reliable rollouts make it easier to link changes to bugs
  • Virtue of boring!
  • Essential vs. accidental complexity
    • SREs should push back when accidental complexity is introduced
  • Code is a liability
    • Remove dead code or other bloat
  • Minimal APIs
    • Smaller APIs easier to test, more reliable
  • Modularity
    • API versioning
    • Same as code, where you’d avoid misc/util classes
  • Releases
    • Small releases easier to measure
    • Can’t tell what happened if we released 100 changes together

Chapter 10: Altering from time-series data

Borgmon

  • Similar-ish to Prometheus
  • Common data format for logging
  • Data used for both dashboards and alerts
  • Formalized a legacy data format, “varz”, which allowed metrics to be viewed via HTTP
  • Adding a metric only requires a single declaration in code
    • low user-cost to add new metric
  • Borgmon fetches /varz from each target periodically
    • Also includes synthetic data like health check, if name was resolved, etc.,
  • Time series arena
    • Data stored in-memory, with checkpointing to disk
    • Fixed sized allocation
    • GC expires oldest entries when full
    • conceptually a 2-d array with time on one axis and items on the other axis
    • 24 bytes for a data point -> 1M unique time series for 12 hours at 1-minute intervals = 17 GB
  • Borgmon rules
    • Algebraic expressions
    • Compute time-series from other time-series
    • Rules evaluated in parallel on a threadpool
  • Counters vs. gauges
    • Def: counters are non-decreasing
    • Def: can take any value
    • Counters preferred to gauges because gauges can lose information depending on sampling interval
  • Altering
    • Borgmon rules can trigger alerts
    • Have minimum duration to prevent “flapping”
    • Usually set to two duration cycles so that missed collections don’t trigger an alert
  • Scaling
    • Borgmon can take time-series data from other Borgmon (uses binary streaming protocol instead of the text-based varz protocol)
    • Can have multiple tiers of filters
  • Prober
    • Black-box monitoring that monitors what the user sees
    • Can be queried with varz or directly send alerts to Altertmanager
  • Configuration
    • Separation between definition of rules and targets being monitored

Chapter 11: Being on-call

  • Typical response time
    • 5 min for user-facing or other time-critical tasks
    • 30 min for less time-sensitive stuff
  • Response times linked to SLOs
    • Ex: 99.99% for a quarter is 13 minutes of downtime; clearly can’t have response time above 13 minutes
    • Services with looser SLOs can have response times in the 10s of minutes (or more?)
  • Primary vs secondary on-call
    • Work distribution varies by team
    • In some, secondary can be backup for primary
    • In others, secondary handles non-urgent / non-paging events, primary handles pages
  • Balanced on-call
    • Def: quantity: percent of time on-call
    • Def: quality: number of incidents that occur while on call

This is great. We should do this. People sometimes get really rough on-call rotations a few times in a row and considering the infrequency of on-call rotations there’s no reason to expect that this should randomly balance out over the course of a year or two.

  • Balance in quantity
    • >= 50% of SRE time goes into engineering
    • Of remainder, no more than 25% spent on-call
  • Prefer multi-site teams
    • Night shifts are bad for health, multi-site teams allow elimination of night shifts
  • Balance in quality
    • On average, dealing with an incident (incl root-cause analysis, remediation, writing postmortem, fixing bug, etc.) takes 6 hours.
    • => shouldn’t have more than 2 incidents in a 12-hour on-call shift
    • To stay within upper bound, want very flat distribution of pages, with median value of 0
  • Compensation -- extra pay for being on-call (time-off or cash)

Chapter 12: Effective troubleshooting

No notes for this chapter.

Chapter 13: Emergency response

  • Test-induced emergency
  • Ex: want to flush out hidden dependencies on a distributed MySQL database
    • Plan: block access to 1/100 of DBs
    • Response: dependent services report that they’re unable to access key systems
    • SRE response: SRE aborts exercise, tries to roll back permissions change
    • Rollback attempt fails
    • Attempt to restore access to replicas works
    • Normal operation restored in 1 hour
    • What went well: dependent teams escalated issues immediately, were able to restore access
    • What we learned: had an insufficient understanding of the system and its interaction with other systems, failed to follow incident response that would have informed customers of outage, hadn’t tested rollback procedures in test env
  • Change-induced emergency
    • Changes can cause failures!
  • Ex: config change to abuse prevention infra pushed on Friday triggered crash-loop bug
    • Almost all externally facing systems depend on this, become unavailable
    • Many internal systems also have dependency and become unavailable
    • Alerts start firing with seconds
    • Within 5 minutes of config push, engineer who pushed change rolled back change and services started recovering
    • What went well: monitoring fired immediately, incident management worked well, out-of-band communications systems kept people up to date even though many systems were down, luck (engineer who pushed change was following real-time comms channels, which isn’t part of the release procedure)
    • What we learned: push to canary didn’t trigger same issue because it didn’t hit a specific config keyword combination; push was considered low-risk and went through less stringent canary process, alerting was too noisy during outage
  • Process-induced emergency

No notes on process-induced example.

Chapter 14: Managing incidents

This is an area where we seem to actually be pretty good. No notes on this chapter.

Chapter 15: Postmortem culture: learning from failure

I'm in strong agreement with most of this chapter. No notes.

Chapter 16: Tracking outages

  • Escalator: centralized system that tracks ACKs to alerts, notifies other people if necessary, etc.
  • Outalator: gives time-interleaved view of notifications for multiple queues
    • Also saves related email and allows marking some messages as “important”, can collapse non-important messages, etc.

Our version of Escalator seems fine. We could really use something like Outalator, though.

Chapter 17: Testing for reliability

Preaching to the choir. No notes on this section. We could really do a lot better here, though.

Chapter 18: Software engineering in SRE

  • Ex: Auxon, capacity planning automation tool
  • Background: traditional capacity planning cycle
    • 1) collect demand forecasts (quarters to years in advance)
    • 2) Plan allocations
    • 3) Review plan
    • 4) Deploy and config resources
  • Traditional approach cons
    • Many things can affect plan: increase in efficiency, increase in adoption rate, cluster delivery date slips, etc.
    • Even small changes require rechecking allocation plan
    • Large changes may require total rewrite of plan
    • Labor intensive and error prone
  • Google solution: intent-based capacity planning
    • Specify requirements, not implementation
    • Encode requirements and autogenerate a capacity plan
    • In addition to saving labor, solvers can do better than human generated solutions => cost savings
  • Ladder of examples of increasingly intent based planning
    • 1) Want 50 cores in clusters X, Y, and Z -- why those resources in those clusters?
    • 2) Want 50-core footprint in any 3 clusters in region -- why that many resources and why 3?
    • 3) Want to meet demand with N+2 redundancy -- why N+2?
    • 4) Want 5 9s of reliability. Could find, for example, that N+2 isn’t sufficient
  • Found that greatest gains are from going to (3)
    • Some sophisticated services may go for (4)
  • Putting constraints into tools allows tradeoffs to be consistent across fleet
    • As opposed to making individual ad hoc decisions
  • Auxon inputs
    • Requirements (e.g., “service must be N+2 per continent”, “frontend servers no more than 50ms away from backend servers”
    • Dependencies
    • Budget priorities
    • Performance data (how a service scales)
    • Demand forecast data (note that services like Colossus have derived forecasts from dependent services)
    • Resource supply & pricing
  • Inputs go into solver (mixed-integer or linear programming solver)

No notes on why SRE software, how to spin up a group, etc. TODO: re-read back half of this chapter and take notes if it’s ever directly relevant for me.

Chapter 19: Load balancing at the frontend

No notes on this section. Seems pretty similar to what we have in terms of high-level goals, and the chapter doesn’t go into low-level details. It’s notable that they do [redacted] differently from us, though. For more info on lower-level details, there’s the Maglev paper.

Chapter 20: Load balancing in the datacenter

  • Flow control
  • Need to avoid unhealthy tasks
  • Naive flow control for unhealthy tasks
    • Track number of requests to a backend
    • Treat backend as unhealthy when threshold is reached
    • Cons: generally terrible
  • Health-based flow control
    • Backend task can be in one of three states: {healthy, refusing connections, lame duck}
    • Lame duck state can still take connections, but sends backpressure request to all clients
    • Lame duck state simplifies clean shutdown
  • Def: subsetting: limiting pool of backend tasks that a client task can interact with
    • Clients in RPC system maintain pool of connections to backends
    • Using pool reduces latency compared to doing setup/teardown when needed
    • Inactive connections are relatively cheap, but not free, even in “inactive” mode (reduced health checks, UDP instead of TCP, etc.)
  • Choosing the correct subset
    • Typ: 20-100, choose base on workload
  • Subset selection: random
    • Bad utilization
  • Subset selection: round robin
    • Order is permuted; each round has its own permutation
  • Load balancing
    • Subset selection is for connection balancing, but we still need to balance load
  • Load balancing: round robin
    • In practice, observe 2x difference between most loaded and least load
    • In practice, most expensive request can be 1000x more expensive than cheapest request
    • In addition, there’s random unpredictable variation in requests
  • Load balancing: least-loaded round robin
    • Exactly what it sounds like: round-robin among least loaded backends
    • Load appears to be measured in terms of connection count; may not always be the best metric
    • This is per client, not globally, so it’s possible to send requests to a backend with many requests from other clients
    • In practice, for larg services, find that most-loaded task uses twice as much CPU as least-loaded; similar to normal round robin
  • Load balancing: weighted round robin
    • Same as above, but weight with other factors
    • In practice, much better load distribution than least-loaded round robin

I wonder what Heroku meant when they responded to Rap Genius by saying “after extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections”.

Chapter 21: Handling overload

  • Even with “good” load balancing, systems will become overloaded
  • Typical strategy is to serve degraded responses, but under very high load that may not be possible
  • Modeling capacity as QPS or as a function of requests (e.g., how many keys the requests read) is failure prone
    • These generally change slowly, but can change rapidly (e.g., because of a single checkin)
  • Better solution: measure directly available resources
  • CPU utilization is usually a good signal for provisioning
    • With GC, memory pressure turns into CPU utilization
    • With other systems, can provision other resources such that CPU is likely to be limiting factor
    • In cases where over-provisioning CPU is too expensive, take other resources into account

How much does it cost to generally over-provision CPU like that?

  • Client-side throttling
    • Backends start rejecting requests when customer hits quota
    • Requests still use resources, even when rejected -- without throttling, backends can spend most of their resources on rejecting requests
  • Criticality
    • Seems to be priority but with a different name?
    • First-class notion in RPC system
    • Client-side throttling keeps separate stats for each level of criticality
    • By default, criticality is propagated through subsequent RPCs
  • Handling overloaded errors
    • Shed load to other DCs if DC is overloaded
    • Shed load to other backends if DC is ok but some backends are overloaded
  • Clients retry when they get an overloaded response
    • Per-request retry budget (3)
    • Per-client retry budget (10%)
    • Failed retries from client cause “overloaded; don’t retry” response to be returned upstream

Having a “don’t retry” response is “obvious”, but relatively rare in practice. A lot of real systems have a problem with failed retries causing more retries up the stack. This is especially true when crossing a hardware/software boundary (e.g., filesystem read causes many retries on DVD/SSD/spinning disk, fails, and then gets retried at the filesystem level), but seems to be generally true in pure software too.

Chapter 22: Addressing cascading failures

  • Typical failure scenarios?
  • Server overload
  • Ex: have two servers
    • One gets overloaded, failing
    • Other one now gets all traffic and also fails
  • Resource exhaustion
    • CPU/memory/threads/file descriptors/etc.
  • Ex: dependencies among resources
    • 1) Java frontend has poorly tuned GC params
    • 2) Frontend runs out of CPU due to GC
    • 3) CPU exhaustion slows down requests
    • 4) Increased queue depth uses more RAM
    • 5) Fixed memory allocation for entire frontend means that less memory is available for caching
    • 6) Lower hit rate
    • 7) More requests into backend
    • 8) Backend runs out of CPU or threads
    • 9) Health checks fail, starting cascading failure
    • Difficult to determine cause during outage
  • Note: policies that avoid servers that serve errors can make things worse
    • fewer backends available, which get too many requests, which then become unavailable
  • Preventing server overload
    • Load test! Must have realistic environment
    • Serve degraded results
    • Fail cheaply and early when overloaded
    • Have higher-level systems reject requests (at reverse proxy, load balancer, and on task level)
    • Perform capacity planning
  • Queue management
    • Queues do nothing in steady state
    • Queued reqs consume memory and increase latency
    • If traffic is steady-ish, better to keep small queue size (say, 50% or less of thread pool size)
    • Ex: Gmail uses queueless servers with failover when threads are full
    • For bursty workloads, queue size should be function of #threads, time per req, size/freq of bursts
    • See also, adaptive LIFO and CoDel
  • Graceful degradation
    • Note that it’s important to test graceful degradation path, maybe by running a small set of servers near overload regularly, since this path is rarely exercised under normal circumstances
    • Best to keep simple and easy to understand
  • Retries
    • Always use randomized exponential backoff
    • See previous chapter on only retrying at a single level
    • Consider having a server-wide retry budget
  • Deadlines
    • Don’t do work where deadline has been missed (common theme for cascading failure)
    • At each stage, check that deadline hasn’t been hit
    • Deadlines should be propagated (e.g., even through RPCs)
  • Bimodal latency
    • Ex: problem with long deadline
    • Say frontend has 10 servers, 100 threads each (1k threads of total cap)
    • Normal operation: 1k QPS, reqs take 100ms => 100 worker threads occupied (1k QPS * .1s)
    • Say 5% of operations don’t complete and there’s a 100s deadline
    • That consumes 5k threads (50 QPS * 100s)
    • Frontend oversubscribed by 5x. Success rate = 1k / (5k + 95) = 19.6% => 80.4% error rate

Using deadlines instead of timeouts is great. We should really be more systematic about this.

Not allowing systems to fill up with pointless zombie requests by setting reasonable deadlines is “obvious”, but a lot of real systems seem to have arbitrary timeouts at nice round human numbers (30s, 60s, 100s, etc.) instead of deadlines that are assigned with load/cascading failures in mind.

  • Try to avoid intra-layer communication
    • Simpler, avoids possible cascading failure paths
  • Testing for cascading failures
    • Load test components!
    • Load testing both reveals breaking and point ferrets out components that will totally fall over under load
    • Make sure to test each component separately
    • Test non-critical backends (e.g., make sure that spelling suggestions for search don’t impede the critical path)
  • Immediate steps to address cascading failures
    • Increase resources
    • Temporarily stop health check failures/deaths
    • Restart servers (only if that would help -- e.g., in GC death spiral or deadlock)
    • Drop traffic -- drastic, last resort
    • Enter degraded mode -- requires having built this into service previously
    • Eliminate batch load
    • Eliminate bad traffic

Chapter 23: Distributed consensus for reliability

  • How do we agree on questions like…
    • Which process is the leader of a group of processes?
    • What is the set of processes in a group?
    • Has a message been successfully committed to a distributed queue?
    • Does a process hold a particular lease?
    • What’s the value in a datastore for a particular key?
  • Ex1: split-brain
    • Service has replicated file servers in different racks
    • Must avoid writing simultaneously to both file servers in a set to avoid data corruption
    • Each pair of file servers has one leader & one follower
    • Servers monitor each other via heartbeats
    • If one server can’t contact the other, it sends a STONITH (shoot the other node in the head)
    • But what happens if the network is slow or packets get dropped?
    • What happens if both servers issue STONITH?

This reminds me of one of my favorite distributed database postmortems. The database is configured as a ring, where each node talks to and replicates data into a “neighborhood” of 5 servers. If some machines in the neighborhood go down, other servers join the neighborhood and data gets replicated appropriately.

Sounds good, but in the case where a server goes bad and decides that no data exists and all of its neighbors are bad, it can return results faster than any of its neighbors, as well as tell its neighbors that they’re all bad. Because the bad server has no data it’s very fast and can report that its neighbors are bad faster than its neighbors can report that it’s bad. Whoops!

  • Ex2: failover requires human intervention
    • A highly sharded DB has a primary for each shard, which replicates to a secondary in another DC
    • External health checks decide if the primary should failover to its secondary
    • If the primary can’t see the secondary, it makes itself unavailable to avoid the problems from “Ex1”
    • This increases operational load
    • Problems are correlated and this is relatively likely to run into problems when people are busy with other issues
    • If there’s a network issues, there’s no reason to think that a human will have a better view into the state of the world than machines in the system
  • Ex3: faulty group-membership algorithms
    • What it sounds like. No notes on this part
  • Impossibility results
    • CAP: P is impossible in real networks, so choose C or A
    • FLP: async distributed consensus can’t gaurantee progress with unreliable network

Paxos

  • Sequence of proposals, which may or may not be accepted by the majority of processes
    • Not accepted => fails
    • Sequence number per proposal, must be unique across system
  • Proposal
    • Proposer sends seq number to acceptors
    • Acceptor agrees if it hasn’t seen a higher seq number
    • Proposers can try again with higher seq number
    • If proposer recvs agreement from majority, it commits by sending commit message with value
    • Acceptors must journal to persistent storage when they accept

Patterns

  • Distributed consensus algorithms are a low-level primitive
  • Reliable replicated state machines
  • Reliable repliacted data and config stores
    • Non distributed-consensus-based systems often use timestamps: problematic because clock synchrony can't be gauranteed
    • See Spanner paper for an example of using distributed consensus
  • Leader election
    • Equivalent to distributed consensus
    • Where work of the leader can performed performed by one process or sharded, leader election pattern allows writing distributed system as if it were a simple program
    • Used by, for example, GFS and Colussus
  • Distributed coordination and locking services
    • Barrier used, for example, in MapReduce to make sure that Map is finished before Reduce proceeds
  • Distributed queues and messaging
    • Queues: can tolerate failures from worker nodes, but system needs to ensure that claimed tasks are processed
    • Can use leases instead of removal from queue
    • Using RSM means that system can continue processing even when queue goes down
  • Performance
    • Conventional wisdom that consensus algorithms can't be used for high-throughput low-latency systems is false
    • Distributed consensus at the core of many Google systems
    • Scale makes this worse for Google than most other companies, but it still works
  • Multi-Paxos
    • Strong leader process: unless a leader has not yet been elected or a failure occurs, only one round trip required to reach consensus
    • Note that another process in the group can propose at any time
    • Can ping pong back and forth and pseudo-livelock
    • Not unqique to multi-paxos,
    • Standard solutions are to elect a proposer process or use rotating proposer
  • Scaling read-heavy workloads
    • Ex: Photon allows reads from any replica
    • Read from stale replica requres extra work, but doesn't produce bad incorrect results
    • To gaurantee reads are up to date, do one of the following:
    • 1) Perform a read-only consensus operation
    • 2) Read data from replica that's guaranteed to be most-up-to-date (stable leader can provide this guarantee)
    • 3) Use quorum leases
  • Quorum leases
    • Replicas can be granted lease over some (or all) data in the system
  • Fast Paxos
    • Designed to be faster over WAN
    • Each client can send Propose to each member of a group of acceptors directly, instead of through a leader
    • Not necessarily faster than classic Paxos -- if RTT to acceptors is long, we've traded one message across slow link plus N in parallel across fast link for N across slow link
  • Stable leaders
    • "Almost all distributed consensus systems that have been designed with performance in mind use either the single stable leader pattern or a system of rotating leadership"

TODO: finish this chapter?

Chapter 24: Distributed cron

TODO: go back and read in more detail, take notes.

Chapter 25: Data processing pipelines

  • Examples of this are MapReduce or Flume
  • Convenient and easy to reason about the happy case, but fragile
    • Initial install is usually ok because worker sizing, chunking, parameters are carefully tuned
    • Over time, load changes, causes problems

Chapter 26: Data integrity

  • Definition not necessarily obvious
    • If an interface bug causes Gmail to fail to display messages, that’s the same as the data being gone from the user’s standpoint
    • 99.99% uptime means 1 hour of downtime per year. Probably ok for most apps
    • 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
  • Backup is non-trivial
    • May have mixture of transactional and non-transactional backup and restore
    • Different versions of business logic might be live at once
    • If services are independently versioned, maybe have many combinations of versions
    • Replicas aren’t sufficient -- replicas may sync corruption
  • Study of 19 data recovery efforts at Google
    • Most common user-visible data loss caused by deletion or loss of referential integrity due to software bugs
    • Hardest cases were low-grade corruption discovered weeks to months later

Defense in depth

  • First layer: soft deletion
    • Users should be able to delete their data
    • But that means that users will be able to accidentally delete their data
    • Also, account hijacking, etc.
    • Accidentally deletion can also happen due to bugs
    • Soft deletion delays actual deletion for some period of time
  • Second layer: backups
    • Need to figure out how much data it’s ok to lose during recovery, how long recovery can take, and how far back backups need to go
    • Want backups to go back forever, since corruption can go unnoticed for months (or longer)
    • But changes to code and schema can make recovery of older backups expensive
    • Google usually has 30 to 90 day window, depending on the service
  • Third layer: early detection
    • Out-of-band integrity checks
    • Hard to do this right!
    • Correct changes can cause checkers to fail
    • But loosening checks can cause failures to get missed

No notes on the two interesting case studies covered.

Chapter 27: Reliable product launches at scale

No notes on this chapter in particular. A lot of this material is covered by or at least implied by material in other chapters. Probably worth at least looking at example checklist items and action items before thinking about launch strategy, though. Also see appendix E, launch coordination checklist.

Chapters 28-32: Various chapters on management

No notes on these.

Notes on the notes

I like this book a lot. If you care about building reliable systems, reading through this book and seeing what the teams around you don’t do seems like a good exercise. That being said, the book isn't perfect. The two big downsides for me stem from the same issue: this is one of those books that's a collection of chapters by different people. Some of the editors are better than others, meaning that some of the chapters are clearer than others and that because the chapters seem designed to be readable as standalone chapters, there's a fair amount of redundancy in the book if you just read it straight through. Depending on how you plan to use the book, that can be a positive, but it's a negative to me. But even including he downsides, I'd say that this is the most valuable technical book I've read in the past year and I've covered probably 20% of the content in this set of notes. If you really like these notes, you'll probably want to read the full book.

If you found this set of notes way too dry, maybe try this much more entertaining set of notes on a totally different book. If you found this to only be slightly too dry, maybe try this set of notes on classes of errors commonly seen in postmortems. In any case, I’d appreciate feedback on these notes. Writing up notes is an experiment for me. If people find these useful, I'll try to write up notes on books I read more often. If not, I might try a different approach to writing up notes or some other kind of post entirely.

We only hire the trendiest

2016-03-21 15:23:44

An acquaintance of mine, let’s call him Mike, is looking for work after getting laid off from a contract role at Microsoft, which has happened to a lot of people I know. Like me, Mike has 11 years in industry. Unlike me, he doesn't know a lot of folks at trendy companies, so I passed his resume around to some engineers I know at companies that are desperately hiring. My engineering friends thought Mike's resume was fine, but most recruiters rejected him in the resume screening phase.

When I asked why he was getting rejected, the typical response I got was:

  1. Tech experience is in irrelevant tech
  2. "Experience is too random, with payments, mobile, data analytics, and UX."
  3. Contractors are generally not the strongest technically

This response is something from a recruiter that was relayed to me through an engineer; the engineer was incredulous at the response from the recruiter. Just so we have a name, let's call this company TrendCo. It's one of the thousands of companies that claims to have world class engineers, hire only the best, etc. This is one company in particular, but it's representative of a large class of companies and the responses Mike has gotten.

Anyway, (1) is code for “Mike's a .NET dev, and we don't like people with Windows experience”.

I'm familiar with TrendCo's tech stack, which multiple employees have told me is “a tire fire”. Their core systems top out under 1k QPS, which has caused them to go down under load. Mike has worked on systems that can handle multiple orders of magnitude more load, but his experience is, apparently, irrelevant.

(2) is hard to make sense of. I've interviewed at TrendCo and one of the selling points is that it's a startup where you get to do a lot of different things. TrendCo almost exclusively hires generalists but Mike is, apparently, too general for them.

(3), combined with (1), gets at what TrendCo's real complaint with Mike is. He's not their type. TrendCo's median employee is a recent graduate from one of maybe five “top” schools with 0-2 years of experience. They have a few experienced hires, but not many, and most of their experienced hires have something trendy on their resume, not a boring old company like Microsoft.

Whether or not you think there's anything wrong with having a type and rejecting people who aren't your type, as Thomas Ptacek has observed, if your type is the same type everyone else is competing for, “you are competing for talent with the wealthiest (or most overfunded) tech companies in the market”.

If you look at new grad hiring data, it looks like FB is offering people with zero experience > $100k/ salary, $100k signing bonus, and $150k in RSUs, for an amortized total comp > $160k/yr, including $240k in the first year. Google's package has > $100k salary, a variable signing bonus in the $10k range, and $187k in RSUs. That comes in a bit lower than FB, but it's much higher than most companies that claim to only hire the best are willing to pay for a new grad. Keep in mind that compensation can go much higher for contested candidates, and that compensation for experienced candidates is probably higher than you expect if you're not a hiring manager who's seen what competitive offers look like today.

By going after people with the most sought after qualifications, TrendCo has narrowed their options down to either paying out the nose for employees, or offering non-competitive compensation packages. TrendCo has chosen the latter option, which partially explains why they have, proportionally, so few senior devs -- the compensation delta increases as you get more senior, and you have to make a really compelling pitch to someone to get them to choose TrendCo when you're offering $150k/yr less than the competition. And as people get more experience, they're less likely to believe the part of the pitch that explains how much the stock options are worth.

Just to be clear, I don't have anything against people with trendy backgrounds. I know a lot of these people who have impeccable interviewing skills and got 5-10 strong offers last time they looked for work. I've worked with someone like that: he was just out of school, his total comp package was north of $200k/yr, and he was worth every penny. But think about that for a minute. He had strong offers from six different companies, of which he was going to accept at most one. Including lunch and phone screens, the companies put in an average of eight hours apiece interviewing him. And because they wanted to hire him so much, the companies that were really serious spent an average of another five hours apiece of engineer time trying to convince him to take their offer. Because these companies had, on average, a ⅙ chance of hiring this person, they have to spend at least an expected (8+5) * 6 = 78 hours of engineer time1. People with great backgrounds are, on average, pretty great, but they're really hard to hire. It's much easier to hire people who are underrated, especially if you're not paying market rates.

I've seen this hyperfocus on hiring people with trendy backgrounds from both sides of the table, and it's ridiculous from both sides.

On the referring side of hiring, I tried to get a startup I was at to hire the most interesting and creative programmer I've ever met, who was tragically underemployed for years because of his low GPA in college. We declined to hire him and I was told that his low GPA meant that he couldn't be very smart. Years later, Google took a chance on him and he's been killing it since then. He actually convinced me to join Google, and at Google, I tried to hire one of the most productive programmers I know, who was promptly rejected by a recruiter for not being technical enough.

On the candidate side of hiring, I've experienced both being in demand and being almost unhireable. Because I did my undergrad at Wisconsin, which is one of the 25 schools that claims to be a top 10 cs/engineering school, I had recruiters beating down my door when I graduated. But that's silly -- that I attended Wisconsin wasn't anything about me; I just happened to grow up in the state of Wisconsin. If I grew up in Utah, I probably would have ended up going to school at Utah. When I've compared notes with folks who attended schools like Utah and Boise State, their education is basically the same as mine. Wisconsin's rank as an engineering school comes from having professors who do great research which is, at best, weakly correlated to effectiveness at actually teaching undergrads. Despite getting the same engineering education you could get at hundreds of other schools, I had a very easy time getting interviews and finding a great job.

I spent 7.5 years in that great job, at Centaur. Centaur has a pretty strong reputation among hardware companies in Austin who've been around for a while, and I had an easy time shopping for local jobs at hardware companies. But I don't know of any software folks who've heard of Centaur, and as a result I couldn't get an interview at most software companies. There were even a couple of cases where I had really strong internal referrals and the recruiters still didn't want to talk to me, which I found funny and my friends found frustrating.

When I could get interviews, they often went poorly. A typical rejection reason was something like “we process millions of transactions per day here and we really need someone with more relevant experience who can handle these things without ramping up”. And then Google took a chance on me and I was the second person on a project to get serious about deep learning performance, which was a 20%-time project until just before I joined. We built the fastest deep learning system in the world. From what I hear, they're now on the Nth generation of that project, but even the first generation thing we built had better per-rack performance and performance per dollar than any other production system out there for years (excluding follow-ons to that project, of course).

While I was at Google I had recruiters pinging me about job opportunities all the time. And now that I'm at boring old Microsoft, I don't get nearly as many recruiters reaching out to me. I've been considering looking for work2 and I wonder how trendy I'll be if I do. Experience in irrelevant tech? Check! Random experience? Check! Contractor? Well, no. But two out of three ain't bad.

My point here isn't anything about me. It's that here's this person3 who has wildly different levels of attractiveness to employers at various times, mostly due to superficial factors that don't have much to do with actual productivity. This is a really common story among people who end up at Google. If you hired them before they worked at Google, you might have gotten a great deal! But no one (except Google) was willing to take that chance. There's something to be said for paying more to get a known quantity, but a company like TrendCo that isn't willing to do that cripples its hiring pipeline by only going after people with trendy resumes, and if you wouldn't hire someone before they worked at Google and would after, the main thing you know is that the person is above average at whiteboard algorithms quizzes (or got lucky one day).

I don't mean to pick on startups like TrendCo in particular. Boring old companies have their version of what a trendy background is, too. A friend of mine who's desperate to hire can't do anything with some of the resumes I pass his way because his group isn't allowed to hire anyone without a degree. Another person I know is in a similar situation because his group has a bright-line rule that causes them to reject people who aren't already employed.

Not only are these decisions non-optimal for companies, they create a path dependence in employment outcomes that causes individual good (or bad) events to follow people around for decades. You can see similar effects in the literature on career earnings in a variety of fields4.

Thomas Ptacek has this great line about how “we interview people whose only prior work experience is "Line of Business .NET Developer", and they end up showing us how to write exploits for elliptic curve partial nonce bias attacks that involve Fourier transforms and BKZ lattice reduction steps that take 6 hours to run.” If you work at a company that doesn't reject people out of hand for not being trendy, you'll hear lots of stories like this. Some of the best people I've worked with went to schools you've never heard of and worked at companies you've never heard of until they ended up at Google. Some are still at companies you've never heard of.

If you read Zach Holman, you may recall that when he said that he was fired, someone responded with “If an employer has decided to fire you, then you've not only failed at your job, you've failed as a human being.” A lot of people treat employment status and credentials as measures of the inherent worth of individuals. But a large component of these markers of success, not to mention success itself, is luck.

Solutions?

I can understand why this happens. At an individual level, we're prone to the fundamental attribution error. At an organizational level, fast growing organizations burn a large fraction of their time on interviews, and the obvious way to cut down on time spent interviewing is to only interview people with "good" qualifications. Unfortunately, that's counterproductive when you're chasing after the same tiny pool of people as everyone else.

Here are the beginnings of some ideas. I'm open to better suggestions!

Moneyball

Billy Beane and Paul Depodesta took the Oakland A's, a baseball franchise with nowhere near the budget of top teams, and created what was arguably the best team in baseball by finding and “hiring” players who were statistically underrated for their price. The thing I find really amazing about this is that they publicly talked about doing this, and then Michael Lewis wrote a book, titled Moneyball, about them doing this. Despite the publicity, it took years for enough competitors to catch on enough that the A's strategy stopped giving them a very large edge.

You can see the exact same thing in software hiring. Thomas Ptacek has been talking about how they hired unusually effective people at Matasano for at least half a decade, maybe more. Google bigwigs regularly talk about the hiring data they have and what hasn't worked. I believe they talked about how focusing on top schools wasn't effective and didn't turn up employees that have better performance years ago, but that doesn't stop TrendCo from focusing hiring efforts on top schools.

Training / mentorship

You see a lot of talk about moneyball, but for some reason people are less excited about… trainingball? Practiceball? Whatever you want to call taking people who aren't “the best” and teaching them how to be “the best”.

This is another one where it's easy to see the impact through the lens of sports, because there is so much good performance data. Since it's basketball season, if we look at college basketball, for example, we can identify a handful of programs that regularly take unremarkable inputs and produce good outputs. And that's against a field of competitors where every team is expected to coach and train their players.

When it comes to tech companies, most of the competition isn't even trying. At the median large company, you get a couple days of “orientation”, which is mostly legal mumbo jumbo and paperwork, and the occasional “training”, which is usually a set of videos and a set of multiple-choice questions that are offered up for compliance reasons, not to teach anyone anything. And you'll be assigned a mentor who, more likely than not, won't provide any actual mentorship. Startups tend to be even worse! It's not hard to do better than that.

Considering how much money companies spend on hiring and retaining "the best", you'd expect them to spend at least a (non-zero) fraction on training. It's also quite strange that companies don't focus more or training and mentorship when trying to recruit. Specific things I've learned in specific roles have been tremendously valuable to me, but it's almost always either been a happy accident, or something I went out of my way to do. Most companies don't focus on this stuff. Sure, recruiters will tell you that "you'll learn so much more here than at Google, which will make you more valuable", implying that it's worth the $150k/yr pay cut, but if you ask them what, specifically, they do to make a better learning environment than Google, they never have a good answer.

Process / tools / culture

I've worked at two companies that both have effectively infinite resources to spend on tooling. One of them, let's call them ToolCo, is really serious about tooling and invests heavily in tools. People describe tooling there with phrases like “magical”, “the best I've ever seen”, and “I can't believe this is even possible”. And I can see why. For example, if you want to build a project that's millions of lines of code, their build system will make that take somewhere between 5s and 20s (assuming you don't enable LTO or anything else that can't be parallelized)5. In the course of a regular day at work you'll use multiple tools that seem magical because they're so far ahead of what's available in the outside world.

The other company, let's call them ProdCo pays lip service to tooling, but doesn't really value it. People describing ProdCo tools use phrases like “world class bad software” and “I am 2x less productive than I've ever been anywhere else”, and “I can't believe this is even possible”. ProdCo has a paper on a new build system; their claimed numbers for speedup from parallelization/caching, onboarding time, and reliability, are at least two orders of magnitude worse than the equivalent at ToolCo. And, in my experience, the actual numbers are worse than the claims in the paper. In the course of a day of work at ProdCo, you'll use multiple tools that are multiple orders of magnitude worse than the equivalent at ToolCo in multiple dimensions. These kinds of things add up and can easily make a larger difference than “hiring only the best”.

Processes and culture also matter. I once worked on a team that didn't use version control or have a bug tracker. For every no-brainer item on the Joel test, there are teams out there that make the wrong choice.

Although I've only worked on one team that completely failed the Joel test (they scored a 1 out of 12), every team I've worked on has had glaring deficiencies that are technically trivial (but sometimes culturally difficult) to fix. When I was at Google, we had really bad communication problems between the two halves of our team that were in different locations. My fix was brain-dead simple: I started typing up meeting notes for all of our local meetings and discussions and taking questions from the remote team about things that surprised them in our notes. That's something anyone could have done, and it was a huge productivity improvement for the entire team. I've literally never found an environment where you can't massively improve productivity with something that trivial. Sometimes people don't agree (e.g., it took months to get the non-version-control-using-team to use version control), but that's a topic for another post.

Programmers are woefully underutilized at most companies. What's the point of hiring "the best" and then crippling them? You can get better results by hiring undistinguished folks and setting them up for success, and it's a lot cheaper.

Conclusion

When I started programming, I heard a lot about how programmers are down to earth, not like those elitist folks who have uniforms involving suits and ties. You can even wear t-shirts to work! But if you think programmers aren't elitist, try wearing a suit and tie to an interview sometime. You'll have to go above and beyond to prove that you're not a bad cultural fit. We like to think that we're different from all those industries that judge people based on appearance, but we do the same thing, only instead of saying that people are a bad fit because they don't wear ties, we say they're a bad fit because they do, and instead of saying people aren't smart enough because they don't have the right pedigree… wait, that's exactly the same.

See also: developer hiring and the market for lemons

Thanks to Kelley Eskridge, Laura Lindzey, John Hergenroeder, Kamal Marhubi, Julia Evans, Steven McCarthy, Lindsey Kuper, Leah Hanson, Darius Bacon, Pierre-Yves Baccou, Kyle Littler, Jorge Montero, Sierra Rotimi-Williams, and Mark Dominus for discussion/comments/corrections.


  1. This estimate is conservative. The math only works out to 78 hours if you assume that you never incorrectly reject a trendy candidate and that you don't have to interview candidates that you “correctly” fail to find good candidates. If you add in the extra time for those, the number becomes a lot larger. And if you're TrendCo, and you won't give senior ICs $200k/yr, let alone new grads, you probably need to multiply that number by at least a factor of 10 to account for the reduced probability that someone who's in high demand is going to take a huge pay cut to work for you.

    By the way, if you do some similar math you can see that the “no false positives” thing people talk about is bogus. The only way to reduce the risk of a false positive to zero is to not hire anyone. If you hire anyone, you're trading off the cost of firing a bad hire vs. the cost of spending engineering hours interviewing.

    [return]
  2. I consider this to generally be a good practice, at least for folks like me who are relatively early in their careers. It's good to know what your options are, even if you don't exercise them. When I was at Centaur, I did a round of interviews about once a year and those interviews made it very clear that I was lucky to be at Centaur. I got a lot more responsibility and a wider variety of work than I could have gotten elsewhere, I didn't have to deal with as much nonsense, and I was pretty well paid. I still did the occasional interview, though, and you should too! If you're worried about wasting the time of the hiring company, when I was interviewing speculatively, I always made it very clear that I was happy in my job and unlikely to change jobs, and most companies are fine with that and still wanted to go through with interviewing. [return]
  3. It's really not about me in particular. At the same time I couldn't get any company to talk to me, a friend of mine who's a much better programmer than me spent six months looking for work full time. He eventually got a job at Cloudflare, was half of the team that wrote their DNS, and is now one of the world's experts on DDoS mitigation for companies that don't have infinite resources. That guy wasn't even a networking person before he joined Cloudflare. He's a brilliant generalist who's created everything from a widely used JavaScript library to one of the coolest toy systems projects I've ever seen. He probably could have picked up whatever problem domain you're struggling with and knocked it out of the park. Oh, and between the blog posts he writes and the talks he gives, he's one of Cloudflare's most effective recruiters.

    Or Aphyr, one of the world's most respected distributed systems verification engineers, who failed to get responses to any of his job applications when he graduated from college less than a decade ago.

    [return]
  4. I'm not going to do a literature review because there are just so many studies that link career earnings to external shocks, but I'll cite a result that I found to be interesting, Lisa Kahn's 2010 Labour Economics paper.

    There have been a lot of studies that show, for some particular negative shock (like a recession), graduating into the negative shock reduces lifetime earnings. But most of those studies show that, over time, the effect gets smaller. When Kahn looked at national unemployment as a proxy for the state of the economy, she found the same thing. But when Kahn looked at state level unemployment, she found that the effect actually compounded over time.

    The overall evidence on what happens in the long run is equivocal. If you dig around, you'll find studies where earnings normalizes after “only” 15 years, causing a large but effectively one-off loss in earnings, and studies where the effect gets worse over time. The results are mostly technically not contradictory because they look at different causes of economic distress when people get their first job, and it's possible that the differences in results are because the different circumstances don't generalize. But the “good” result is that it takes 15 years for earnings to normalize after a single bad setback. Even a very optimistic reading of the literature reveals that external events can and do have very large effects on people's careers. And if you want an estimate of the bound on the "bad" case, check out, for example, the Guiso, Sapienza, and Zingales paper that claims to link the productivity of a city today to whether or not that city had a bishop in the year 1000.

    [return]
  5. During orientation, the back end of the build system was down so I tried building one of the starter tutorials on my local machine. I gave up after an hour when the build was 2% complete. I know someone who tried to build a real, large scale, production codebase on their local machine over a long weekend, and it was nowhere near done when they got back. [return]

Harry Potter and the Methods of Rationality review by su3su2u1

2016-03-01 08:00:00

These are archived from the now defunct su3su2u1 tumblr. Since there was some controversy over su3su2u1's identity, I'll note that I am not su3su2u1 and that hosting this material is neither an endorsement nor a sign of agreement.

Harry Potter and the Methods of Rationality full review

I opened up a bottle of delicious older-than-me scotch when Terry Pratchett died, and I’ve been enjoying it for much of this afternoon, so this will probably be a mess and cleaned up later.

Out of 5 stars, I’d give HPMOR a 1.5. Now, to the review (this is almost certainly going to be long)

The good

HPMOR contains some legitimately clever reworkings of the canon books to fit with Yudkowsky’s modified world:

A few examples- In HPMOR, the “interdict of Merlin” prevents wizards from writing down powerful spells, so Slytherin put the Basilisk in the chamber of secrets to pass on his magical lore. The prophecy “the dark lord will mark him as his own” was met when Voldemort gave Hariezer the same grade he himself had received.

Yudkowsky is also well read, and the story is peppered with reference to legitimately interesting science. If you google and research every reference, you’ll learn a lot. The problem is that most of the in-story references are incorrect, so if you don’t google around you are likely to pick up dozens of incorrect ideas.

The writing style during action scenes is pretty good. It keeps the pace moving and brisk and can be genuinely fun to read.

The bad

Stilted, repetitive writing

A lot of this story involves conversations that read like ham-fisted attempts at manipulation, filled with overly stilted language. Phrases like “Noble and Most Ancient House,” “General of Sunshine,” “ General of Chaos,”etc are peppered in over and over again. It’s just turgid. It smooths out when events are happening, but things are rarely happening.

Bad ideas

HPMOR is full of ideas I find incredibly suspect- the only character trait worth anything in the story (both implicitly and explicitly) is intelligence, and the primary use of intelligence within the story is manipulation. This leads to cloying levels of a sort of nerd elitism. Ron and Hagrid are basically dismissed out of hand in this story (Ron explicitly as being useless, Hagrid implicitly so) because they aren’t intelligent enough, and Hariezer explicitly draws implicit NPC vs real-people distinctions.

The world itself is constructed to back up these assertions- nothing in the wizarding world makes much sense, and characters often behave in silly ways (”like NPCs”) to be a foil for Hariezer.

The most ridiculous example of this is that the wizarding world justice is based on two cornerstones- poltiicans decide guilt or innocence for all wizard crimes, and the system of blood debts. All of the former death eaters who were pardoned (for claiming to be imperius cursed) apparently owe a blood debt to Hariezer, and so as far as wizarding justice is concerned he is above the law. He uses this to his advantage at a trial for Hermione.

Bad pedagogy

Hariezer routinely flubs the scientific concepts the reader is supposed to be learning. Almost all of the explicit in story science references are incorrect, as well as being overly-jargon filled.

Some of this might be on purpose- Hariezer is supposed to be only 11. However, this is terrible pedagogy. The reader’s guide to rationality is completely unreliable. Even weirder, the main antagonist, Voldemort, is also used as author mouthpiece several times. So the pedagogy is wrong at worst, and completely unreliable at best.

And implicitly, the method Hariezer relies on for the majority of his problem solving is Aristotelian science. He looks at things, thinks real hard, and knows the answer. This is horrifyingly bad implicit pedagogy.

Bad plotting

Over the course of the story, Hariezer moves from pro-active to no-active. At the start of the story he has a legitimate positive agenda- he wants to use science to uncover the secrets of magic. As the story develops, however, he completely loses sight of that goal, and he instead becomes just a passenger in the plot- he competes in Quirrell’s games and goes through school like any other student. When Voldemort starts including Hariezer in his plot, Hariezer floats along in a completely reactive way,etc.

Not until Hermione dies, near the end of the story, does Hariezer pick up a positive goal again (ending death) and he does absolutely nothing to achieve it. He floats along reacting to everything, and Voldemort defeats death and revives Hermione with no real input from Hariezer at all.

For a character who is supposed to be full of agency, he spends very little time exercising it in a proactive way.

Nothing has consequences (boring!)

And this brings me to another problem with the plotting- nothing in this story has any consequences. Nothing that goes wrong has any lasting implications for the story at all, which makes all the evens on hand ultimately boring. Several examples- early in the story Hariezer uses his time turner to solve even the simplest problems. Snape is asking you questions about potions you don’t know? Time travel. Bullies are stealing a meaningless trinket? Time travel,etc. As a result of these rule violations, his time turner is locked down by Professor Mcgonagall. Despite this Hariezer continues to use his time turner to solve all of his problems- the plot introduces another student willing to send a time turner message for a small amount of money via. “slytherin mail” it’s even totally anonymous.

Another egregious example of this is Quirrell’s battle game- the prize for the battle game is handed out by Quirrell in chapter 35 or so, and there are several more battle games after the prize! The reader knows that it doesn’t at all matter who wins these games- the prize is already awarded! What’s the point? The reader knows the prize has been given out, why are they invested in the proceedings at all?

When Hariezer becomes indebted to Luscious Malfoy, it never constrains him in any way. He becomes in debt, Dumbledore tells him it’s bad, he does literally nothing to deal with the problem. Two weeks later, Hermione dies and the debt gets cancelled.

When Hermione DIES Hariezer does nothing, and a few weeks later Voldemort brings her back. Nothing that happens ever matters.

The closest thing to long term repercussions is Hariezer helping Bellatrix Black escape- but we literally never see Bellatrix after that.

Hariezer never acts positively to fix his problems, he just bounces along whining about how humans need to defeat death until his problems get solved for him.

Mystery, dramatic irony and genre savvy

If you’ve read the canon books, you know at all times what is happening in the story. Voldemort has possessed Quirrell, Hariezer is a horcrux, Quirrell wants the philsopher’s stone, etc. There are bits and pieces that are modified, but the shape of the story is exactly canon. So all the mystery is just dramatic irony.

This is fine, as far as it goes, but there is a huge amount of tension because Hariezer is written as “genre savvy” and occasionally says things like “the hero of story such-and-such would do this” or “I understand mysterious prophecies from books.” The story is poking at cliches that the story wholeheartedly embraces. Supposedly Hariezer has read enough books just like this that dramatic irony liked this shouldn’t happen, as the story points out many times,- he should be just as informed as the reader. AND YET…

The author is practically screaming “wouldn’t it be lazy that Harry’s darkside is because he is a horcrux?” And yet, Harry’s darkside is because he is a horcrux.

Even worse, the narration of the book takes lots of swipes at the canon plots while “borrowing” the plot of the books.

Huge tension between the themes/lessons and the setting

The major themes of this book are in major conflict with the setting throughout the story.

One major theme is the need for secretive science to hide dangerous secrets- it’s echoed in the way Hariezer builds his “bayesian conspiracy,” reinforced by Hariezer and Quirrell’s attitudes toward nuclear weapons (and their explicit idea that people smart enough to build atomic weapons wouldn’t use them), and it’s reinforced at the end of the novel when Hariezer’s desire to dissolve some of the secrecy around magic is thwarted by a vow he took to not-end-the-world.

Unfortunately, that same secrecy is portrayed as having stagnated the progress of the wizarding world, and preventing magic from spreading. That same secrecy might well be why the wizarding world hasn’t already ended death and made thousands of philosopher’s stones.

Another major theme is fighting death/no-afterlife. But this is a fantasy story with magic. There are ghosts, a gate to the afterlife, a stone to talk to your dead loved ones,etc. The story tries to lamp shade it a bit, but that fundamental tension doesn’t go away. Some readers even assumed that Hariezer was simply wrong about an afterlife in the story- because they felt the tension and used my point above (unreliable pedagogy) to put the blame on Hariezer. In the story, the character who actually ended death WAS ALSO THE ANTAGONIST. Hariezer’s attempts are portrayed AS SO DANGEROUS THEY COULD END THE WORLD.

And finally- the major theme of this story is the supremacy of Bayesian reasoning. Unfortunately, as nostalgebraist pointed out explicitly, a world with magic is a world where your non-magic based Bayesian prior is worthless. Reasoning time and time again from that prior leads to snap conclusions unlikely to be right- and yet in the story this works time and time again. Once again, the world is fighting the theme of the story in obvious ways.

Let’s talk about Hermoine

The most explicitly feminist arc in this story is the arc where Hermione starts SPHEW, a group dedicated to making more wizarding heroines. The group starts out successful, gets in over their head, and Hariezer has to be called in to save the day (with the help of Quirrell).

At the end of the arc, Hariezer and Dumbledore have a long conversation about whether or not they should have let Hermione and friends play their little bully fighting game- which feels a bit like retroactively removing the characters agency. Sure, the women got to play at their fantasy, but only at the whim of the real heroes.

By the end of the story, Hermione is an indestructible part-unicorn/part-troll immortal. And what is she going to do with this power? Become Hariezer’s lab assistant, more or less. Be sent on quests by him. It just feels like Hermione isn’t really allowed to grow into her own agency in a meaningful way.

This isn’t to say that it’s intentional (pretty much the only character with real, proactive agency in this story is Quirrell) - but it does feel like women get the short end of the stick here.

Sanderson’s law of magic

So I’ve never read Sanderson, but someone point me to his first law of magic

Sanderson’s First Law of Magics: An author’s ability to solve conflict with magic is DIRECTLY PROPORTIONAL to how well the reader understands said magic. The idea here is that if your magic is laid out with clear rules, the author should feel free to solve problems with it- if your magic is mysterious and vague like Gandolf you shouldn’t solve all the conflict with magic, but if you lay out careful rules you can have the characters magic up the occasional solution. I’m not sure I buy into the rule fully, but it does make a good point- if the reader doesn’t understand your magic the solution might feel like it comes out of nowhere.

Yudkowsky never clearly lays out most of the rules of magic, and yet still solves all his problems via magic (and magic mixed with science). We don’t know how brooms work, but apparently if you strap one to a rocket you can actually steer the rocket, you won’t fall off the thing, and you can go way faster than other broomsticks.

This became especially problematic when he posted his final exam- lots of solutions were floated around each of which relied on some previously ill-defined aspect of the magic. Yudkowsky’s own solution relied on previously ill-defined transfiguration.

And when he isn’t solving problems like that, he is relying on the time turner over and over again. Swatting flies with flame throwers over and over again.

Coupled with the world being written as “insane” and it just feels like it’s lazy conflict resolution.

Conclusion

A largely forgettable, overly long nerd power fantasy, with a bit of science (most of it wrong) and a lot of bad ideas. 1.5 stars.

Individual chapter reviews below.

HPMOR 1

While at lunch, I dug into the first chapter of HPMOR. A few notes:

This isn’t nearly as bad as I remember, the writing isn’t amazing but its serviceable.. Either some editing has taken place in the last few years, or I’m less discerning than I once was.

There is this strange bit, where Harry tries to diffuse an argument his parents are having with:

“"Mum,” Harry said. “If you want to win this argument with Dad, look in chapter two of the first book of the Feynman Lectures on Physics. There’s a quote there about how philosophers say a great deal about what science absolutely requires, and it is all wrong, because the only rule in science is that the final arbiter is observation - that you just have to look at the world and report what you see. ”

This seems especially out of place, because no one is arguing about what science is.

Otherwise, this is basically an ok little chapter. Harry and Father are skeptical magic could exist, so send a reply letter to Hogwarts asking for a professor to come and show them some magics.

HPMOR 2: in which I remember why I hated this

This chapter had me rolling my eyes so hard that I now I have quite the headache. In this chapter, Professor McGonagall shows up and does some magic, first levitating Harry’s father, and then turning into a cat. Upon seeing the first, Harry drops some Bayes, saying how anticlimatic it was ‘to update on an event of infinitesimal probability,’ upon seeing the second, Hariezer Yudotter greets us with this jargon dump:

“You turned into a cat! A SMALL cat! You violated Conservation of Energy! That’s not just an arbitrary rule, it’s implied by the form of the quantum Hamiltonian! Rejecting it destroys unitarity and then you get FTL signalling!”

First, this is obviously atrocious writing. Most readers will get nothing out of this horrific sentence. He even abbreviated faster-than-light as FTL, to keep the density of understandable words to a minimum.

Second, this is horrible physics for the following reasons:

  • the levitation already violated conservation of energy,which you found anticlimactic fuck you Hariezer
  • the deep area of physics concerned with conservation of energy is not quantum mechanics, its thermodynamics. Hariezer should have had a jargon dump about perpetual motion machines. To see how levitation violates conservation of energy, imagine taking a generator like the Hoover dam and casting a spell to levitate all the water from the bottom of the dam back up to the top to close the loop. As long as you have wizard to move the water, you can generate power forever. Exercise for the reader- devise a perpetual motion machine powered by shape changers (hint:imagine an elevator system of two carts hanging over a pully. On one side, an elephant, on the other a man. Elephant goes down, man goes up. At the bottom, the elephant turns into a man and at the top the man turns into an elephant. What happens to the pulley over time?) -the deeper area related to conservation of energy is not unitarity, as is implied in the quote. There is a really deep theorem in physics, due to Emmy Noether, that tells us that conservation of energy really means that physics is time translationaly invariant. This means there aren’t special places in time, the laws tomorrow are basically the same as yesterday and today. (tangential aside- this is why we shouldn’t worry about a lack of energy conservation at the big bang, if the beginning of time was a special point, no one would expect energy to be conserved there). Unitarity in quantum mechanics is basically a fancy way of saying probability is conserved. You CAN have unitarity without conservation of energy. Technical aside- its easy to show that if the unitary operator is time-translation invariant, there is an operator that commutes with the unitary operator, usually called the hamiltonian. Without that assumption, we lose the hamiltonian but maintain unitarity. -None of this has much to do at all with faster than light signalling, which would be the least of our concern if we had just discovered a source of infinite energy.

I used to teach undergraduates, and I would often have some enterprising college freshman (who coincidentally was not doing well in basic mechanics) approach me to talk about why string theory was wrong. It always felt like talking to a physics madlibs book. This chapter let me relive those awkward moments.

Sorry to belabor this point so much, but I think it sums up an issue that crops up from time to time in Yudkowsky’s writing, when dabbling in a subject he doesn’t have much grounding in, he ends up giving actual subject matter experts a headache.

Summary of the chapter- McGonagall visits and does some magic, Harry is convinced magic is real, and they are off to go shop for Harry’s books.

Never Read Comments

I read the comments on an HPMOR chapter, which I recommend strongly against. I wish I could talk to several of the commentators, and gently talk them out of a poor financial decision.

Poor, misguided random internet person- your donation to MIRI/LessWrong will not help save the world. Even if you grant all their (rather silly) assumptions MIRI is a horribly unproductive research institute- in more than a decade, it has published fewer peer reviewed papers than the average physics graduate student does while in grad school. The majority of money you donate to MIRI will go into the generation of blog posts and fan fiction. If you are fine with that, then go ahead and spend your money, but don’t buy into the idea that this money will save the world.

HPMOR 3: uneventful, inoffensive

This chapter is worse than the previous chapters. As Hariezer (I realize this portmanteau isn’t nearly as clever as I seem to think it is, but I will continue to use it) enters diagon alley, he remarks

It was like walking through the magical items section of an Advanced Dungeons and Dragons rulebook (he didn’t play the game, but he did enjoy reading the rulebooks).

For reasons not entirely clear to me, the line filled me with rage.

As they walk McGonagall tells Hariezer about Voldemort, noting that other countries failed to come to Britain’s aid. This prompts Hariezer to immediately misuse the idea of the Bystander Effect (an exercise left to the reader- do social psychological phenomena that apply to individuals also apply to collective entities, like countries? Are the social-psychological phenomena around failure to act in people likely to also explain failure to act as organizations?)

Thats basically it for this chapter. Uneventful chapter- slightly misused scientific stuff, a short walk through diagon alley, standard Voldemort stuff. The chapter ends with some very heavy handed foreshadowing:

(And somewhere in the back of his mind was a small, small note of confusion, a sense of something wrong about that story; and it should have been a part of Harry’s art to notice that tiny note, but he was distracted. For it is a sad rule that whenever you are most in need of your art as a rationalist, that is when you are most likely to forget it.)

If Harry had only attended more CFAR workshops…

HPMOR 4: in which, for the first time, I wanted the author to take things further

So first, I actually like this chapter more than the previous few, because I think its beginning to try to deliver on what I want in the story. And now, my bitching will commence:

A recurring theme of the LessWrong sequences that I find somewhat frustrating is that (apart from the Bayesian Rationalist) the world is insane. This same theme pops up in this MOR chapter, where the world is created insane by Yudkowsky, so that Hariezer can tell you why.

Upon noticing the wizarding world uses coins of silver and gold, Hariezer asks about exchange rates, and asks the bank goblin how much it would cost to get a big chunk of silver turned into coins, the goblin says he’ll check with his superiors, Hariezer asks him to estimate, and the estimate is that the fee is about 5% of the silver.

This prompts Hariezer to realize that he could do the following:

  1. Take gold coins and buy silver with them in the muggle world
  2. bring the silver to Gringots and have it turned into coins
  3. convert the silver coins to gold coins, ending up with more gold than you started with, start the loop over until the muggle prices make it not profitable

(of course, the in-story explanation is overly-jargon filled as usual)

This is somewhat interesting, and its the first look at what I want in a story like this- the little details of the wizarding world that would never be covered in a children’s story. Stross wrote a whole book exploring money/economics in a far future society (Neptune’s Brood, its only ok), there is a lot of fertile ground for Yudkowsky here.

In a world where wizards can magic wood into gold, how do you keep counterfeiting at bay? Maybe the coins are made of special gold only goblins know how to find (maybe the goblin hordes hoard (wordplay!) this special gold like De beers hoards diamonds).

Maybe the goblins carefully magic money into and out of existence in order to maintain a currency peg. Maybe its the perfect inflation- instead of relying on banks to disperse the coins every and now and then the coins in people’s pockets just multiply at random.

Instead, we get a silly, insane system (don’t blame Rowling either- Yudkowsky is more than willing to go off book, AND the details of this simply aren’t discussed, for good reason, in the genre Rowling wrote the books in), and rationalist Hariezer gets an easy ‘win’. Its not a BAD section, but it feels lazy.

And a brief note on the writing style- its still oddly stilted, and I wonder how good it would be at explaining ideas to someone unfamiliar. For instance, Hariezer gets lost in thought McGonagall says something, Hariezer replies:

“"Hm?” Harry said, his mind elsewhere. “Hold on, I’m doing a Fermi calculation." "A what? ” said Professor McGonagall, sounding somewhat alarmed. “It’s a mathematical thing. Named after Enrico Fermi. A way of getting rough numbers quickly in your head…”“

Maybe it would feel less awkward for Hariezer to say "Hold on, I’m trying to estimate how much gold is in the vault.” And then instead of saying “its a math thing,” we could follow Hariezer’s thoughts as he carefully constructs his estimate (as it is, the estimate is crammed into a hard-to-read paragraph).

Its a nitpick, sure, but the story thus far is loaded with such nits.

Chapter summary- Harry goes to gringots, takes out money.

HPMOR 5: in which the author assures us repeatedly this chapter is funny

This chapter is, again, mostly inoffensive, although there is a weird tonal shift. The bulk of this chapter is played broadly for laughs. There is actually a decent description of the fundamental attribution error, although its introduced with this twerpy bit of dialogue

Harry looked up at the witch-lady’s strict expression beneath her pointed hat, and sighed. “I suppose there’s no chance that if I said fundamental attribution error you’d have any idea what that meant.”

This sort of thing seems like awkward pedagogy. If the reader doesn’t know it, Hariezer is now exasperated with the reader as well as with whoever Yudkowsky is currently using as a foil.

Now, the bulk of this chapter involves Hariezer being left alone to buy robes, where he meets and talks to Draco Malfoy. Hariezer, annoyed at having people say to him “OMG, YOU ARE HARRY POTTER!” upon meeting and learning Malfoy’s name, exclaims “OMG, YOU ARE DRACO MALFOY!”. Malfoy accepts this as a perfectly normal reaction to his imagined fame, and a mildly amusing conversation occurs. Its a fairly clever idea.

Unfortunately, its marred by the literary equivalent of a sitcom laugh track. Worried that the reader isn’t sure if they should be laughing, Yudkowsky interjects phrases like these throughout:

Draco’s attendant emitted a sound like she was strangling but kept on with her work One of the assistants, the one who’d seemed to recognise Harry, made a muffled choking sound. One of Malkin’s assistants had to turn away and face the wall. Madam Malkin looked back silently for four seconds, and then cracked up. She fell against the wall, wheezing out laughter, and that set off both of her assistants, one of whom fell to her hands and knees on the floor, giggling hysterically.

The reader is constantly told that the workers in the shop find it so funny they can barely contain their laughter. It feels like the author constantly yelling GET IT YOU GUYS? THIS IS FUNNY!

As far as the writing goes, the tonal shift to broad comedy feels a bit strange and happens with minimal warning (there is a brief conversation earlier in the chapter thats also played for a laugh), and everything is as stilted as its always been. For example, when McGonagall walks into the robe shop in time to hear Malfoy utter some absurdities, Harry tells her

“He was in a situational context where those actions made internal sense -”

Luckily, Hariezer gets cut off before he starts explaining what a joke is.

Chapter summary- Hariezer buys robes, talks to Malfoy.

HPMOR 6: Yud lets it all hang out

The introduction suggested that the story really gets moving after chapter 5. If this is an example of what “really moving” looks like, I fear I’ll soon stop reading. Apart from my rant about chapter 2, things had been largely light, and inoffensive up until this chapter. Here, I found myself largely recoiling. We shift from the broad comedy of the last chapter to a chapter filled with weirdly dark little rants.

As should be obvious by now, I find the line between Eliezer and Harry to be pretty blurry (hence my annoying use of Hariezer). In this chapter, that line disappears completely as we get passages like this

Harry had always been frightened of ending up as one of those child prodigies that never amounted to anything and spent the rest of their lives boasting about how far ahead they’d been at age ten. But then most adult geniuses never amounted to anything either. There were probably a thousand people as intelligent as Einstein for every actual Einstein in history. Because those other geniuses hadn’t gotten their hands on the one thing you absolutely needed to achieve greatness. They’d never found an important problem.

There are dozens of such passages that could be ripped directly from some of Hariezer’s friendly AI writing and pasted right into MOR. Its a bit disconcerting, in part because its forcing me to face just how much of Eliezer’s other writing of wasted time with.

The chapter begins strongly enough, Hariezer starts doing some experiments with his magic pouch. If he asks for 115 gold coins, it comes, but not if he asks for 90+25 gold coins. He tries using other words for gold in other languages, etc. Unfortunately, it leads him to say this:

“I just falsified every single hypothesis I had! How can it know that ‘bag of 115 Galleons’ is okay but not ‘bag of 90 plus 25 Galleons’? It can count but it can’t add? It can understand nouns, but not some noun phrases that mean the same thing?…The rules seem sorta consistent but they don’t mean anything! I’m not even going to ask how a pouch ends up with voice recognition and natural language understanding when the best Artificial Intelligence programmers can’t get the fastest supercomputers to do it after thirty-five years of hard work,”

So here is the thing- it would be very easy to write a parser that behaves exactly like what Hariezer describes with his bag. You would just have a look-up table with lots of single words for gold in various languages. Nothing fancy at all. Its behaving oddly ENTIRELY BECAUSE ITS NOT DOING NATURAL LANGUAGE. I hope we revisit the pouch in a later chapter to sort this out. I reiterate, its stuff like this that (to me at least) were the whole premise of this story- flesh out the rules of this wacky universe.

Immediately after this, the story takes a truly bizarre turn. Hariezer spots a magic first aid kit, and wants to buy it. In order to be a foil for super-rationalist Harry, McGonagall then immediately becomes immensely stupid, and tries to dissuade him from purchasing it. Note, she doesn’t persuade him by saying “Oh, there are magical first aid kits all over the school,” or “there are wizards watching over the boy who lived who can heal you with spells if something happens” or anything sensible like that, she just starts saying he’d never need it.

This leads Harry to a long description of the planning fallacy, and he says to counter it he always tries to assume the worst possible outcomes. (Note to Harry and the reader: the planning fallacy is a specific thing that occurs when people or organizations plan to accomplish a task. What Harry is trying to overcome is more correctly optimism bias.).

This leads McGonagall to start lightly suggesting (apropos of nothing) that maybe Harry is an abused child. Hariezer responds with this tale:

"There’d been some muggings in our neighborhood, and my mother asked me to return a pan she’d borrowed to a neighbor two streets away, and I said I didn’t want to because I might get mugged, and she said, ‘Harry, don’t say things like that!’ Like thinking about it would make it happen, so if I didn’t talk about it, I would be safe. I tried to explain why I wasn’t reassured, and she made me carry over the pan anyway. I was too young to know how statistically unlikely it was for a mugger to target me, but I was old enough to know that not-thinking about something doesn’t stop it from happening, so I was really scared.” … I know it doesn’t sound like much,” Harry defended. “But it was just one of those critical life moments, you see? … That’s when I realised that everyone who was supposed to protect me was actually crazy, and that they wouldn’t listen to me no matter how much I begged them So we are back to the world is insane, as filtered through this odd little story.

Then McGonagall asks if Harry wants to buy an owl, and Harry says no he’d be too worried he’d forget to feed it or something. Which prompts McGonagall AGAIN to suggest Harry had been abused, which leads Harry into an odd rant about how false accusations of child abuse ruin families (which is true, but seriously, is this the genre for this rant? What the fuck is happening with this chapter?) This ends up with McGonagall implying Harry must have been abused because he is so weird, and maybe some cast a spell to wipe his memory of it (the spell comes up after Harry suggests repressed memories are BS pseudoscience, which again, is true, BUT WHY IS THIS HAPPENING IN THIS STORY?)

Harry uses his ‘rationalist art’ (literally “Harry’s rationalist skills begin to boot up again”) to suggest an alternative explanation

"I’m too smart, Professor. I’ve got nothing to say to normal children. Adults don’t respect me enough to really talk to me. And frankly, even if they did, they wouldn’t sound as smart as Richard Feynman, so I might as well read something Richard Feynman wrote instead. I’m isolated, Professor McGonagall. I’ve been isolated my whole life. Maybe that has some of the same effects as being locked in a cellar. And I’m too intelligent to look up to my parents the way that children are designed to do. My parents love me, but they don’t feel obliged to respond to reason, and sometimes I feel like they’re the children - children who won’t listen and have absolute authority over my whole existence. I try not to be too bitter about it, but I also try to be honest with myself, so, yes, I’m bitter.

After that weird back and forth the chapter moves on, Harry goes and buys a wand, and then from conversation begins to suspect that the Voldemort might still be alive. When McGonagall doesn’t want to tell him more, “a terrible dark clarity descended over his mind, mapping out possible tactics and assessing their consequences with iron realism.”

This leads Hariezer to blackmail McGonagall- he won’t tell people Voldemort is still alive if she tells him about the prophecy. Its another weird bit in a chapter absolutely brimming with weird bits.

Finally they go to buy a trunk, but they are low on gold (note to the reader: here would have been an excellent example of the planning fallacy). But luckily Hariezer had taken extra from the vault. Rather than simply saying “oh, I brought some extra”, he says

So - suppose I had a way to get more Galleons from my vaultwithout us going back to Gringotts, but it involved me violating the role of an obedient child. Would I be able to trust you with that, even though you’d have to step outside your own role as Professor McGonagall to take advantage of it?

So he logic-chops her into submission, or whatever, and they buy the trunk.

This chapter for me was incredibly uncomfortable. McGonagall behaves very strangely so she can act as a foil for all of Hariezer’s rants, and when the line between Hariezer and Eliezer fell away completely, it felt a bit oddly personal.

Oh, right, there was also a conversation about the rule against underage magic

"Ah,” Harry said. “That sounds like a very sensible rule. I’m glad to see the wizarding world takes that sort of thing seriously.”

I can’t help but draw parallels to the precautions Yud wants with AI.

Summary: Harry finished buying school supplies(I hope).

HPMOR 7: Uncomfortable elitism, and rape threats

A brief warning: Like always I’m typing this thing on my phone, so strange spell-check driven typos almost certainly abound. However, I’m also pretty deep in my cups (one of the great privileges of leaving academia is that I can afford to drink Lagavulin more than half my age like its water. The downside is no longer get to teach and so must pour my wisdom out in the form of a critique of a terrible fan fiction, that all of one person is probably reading)

This chapter took the weird tonal shift from the last chapter and just ran with it.

We are finally heading toward Hogwarts,so the chapter opens with the classic platform 9 3/4 bit from the book. And then it takes an uncomfortable elitist tone: Harry asks Ron Weasley to call him “Mr. Spoo” so that he can remain incognito, and Ron, a bit confused says “Sure Harry.” That one slip up allows Hariezer to immediately peg Ron as an idiot. In the short conversation that follows he mentally thinks of Ron as stupid several times and then he tries to explain to Ron why Quidditch is a stupid game.

It is my understanding from a (rather loose reading of the) books, that like cricket, quidditch games last weeks, EVEN MONTHS. In a game lasting literally weeks, one team could conceivably be up by 15 goals. In one of the books, I believe an important match went against the team that caught the snitch in one of the books. This is not to entirely defend quidditch, but it doesn’t HAVE to be an easy target. I think part of the ridicule that quidditch gets is that non-British/non-Indian audiences are perhaps not capable of appreciating that there are sports (cricket) that are played out over weeks that are very high scoring.

Either way, the WAY that Hariezer attacks quidditch is at expense of Ron, and it feels like a nerd sneering at a jock for liking sports. But thats just the lead up to the cloying nerd-elitism. Draco comes over, Hariezer is quick to rekindle that budding friendship, and we get the following conversation about Ron:

If you didn’t like him,” Draco said curiously, “why didn’t you just walk away?” "Um… his mother helped me figure out how to get to this platform from the King’s Cross Station, so it was kind of hard to tell him to get lost. And it’s not that I hate this Ron guy,” Harry said, “I just, just…” Harry searched for words. "Don’t see any reason for him to exist?" offered Draco. "Pretty much.

Just cloying, uncomfortable levels of nerd-elitism.

Now that Hariezer and Draco are paired back up, they can have lot of uncomfortable conversations. First, Draco shares something only slightly personal, which leads to this

"Why are you telling me that? It seems sort of… private…” Draco gave Harry a serious look. “One of my tutors once said that people form close friendships by knowing private things about each other, and the reason most people don’t make close friends is because they’re too embarrassed to share anything really important about themselves.” Draco turned his palms out invitingly. “Your turn?”

Hariezer consider this a masterful use of the social psychology idea of reciprocity (which just says if you do something for someone, they’re likely to do it for you). Anyway, this exchange is just a lead up to this, which feels like shock value for no reason:

"Hey, Draco, you know what I bet is even better for becoming friends than exchanging secrets? Committing murder." "I have a tutor who says that," Draco allowed. He reached inside his robes and scratched himself with an easy, natural motion. "Who’ve you got in mind?" Harry slammed The Quibbler down hard on the picnic table. “The guy who came up with this headline.” Draco groaned. “Not a guy. A girl. A ten-year-old girl, can you believe it? She went nuts after her mother died and her father, who owns this newspaper, is convinced that she’s a seer, so when he doesn’t know he asks Luna Lovegood and believes anything she says.” … Draco snarled. “She has some sort of perverse obsession about the Malfoys, too, and her father is politically opposed to us so he prints every word. As soon as I’m old enough I’m going to rape her.”

So, Hariezer is joking about the murder (its made clear later), but WHAT THE FUCK IS HAPPENING? These escalating friendship-tests feel contrived, reciprocity is effective when you don’t make demands immediately, which is why when you get a free sample at the grocery store the person at the counter doesn’t say “did you like that? Buy this juice or we won’t be friends anymore.” This whole conversation feels ham fisted, Hariezer is consistently telling us about all the manipulative tricks they are both using. Its less a conversation and more people who just sat through a shitty marketing seminar trying to try out what they learned. WITH RAPE.

After that, Draco has a whole spiel about how the legal system of the wizard world is in the pocket of the wealthy, like the Malfoys, which prompts Hariezer to tell us that only countries descended from the enlightenment have law-and-order (and I take it from comments that originally there was some racism somewhere in here that has since been edited out). Note: the wizarding world HAS LITERAL MAGIC TRUTH POTIONS, but we are to believe our enlightenment legal system works better? This seems like an odd, unnecessary narrative choice.

Next, Hariezer tries to recruit Draco to the side of science with this:

Science doesn’t work by waving wands and chanting spells, it works by knowing how the universe works on such a deep level that you know exactly what to do in order to make the universe do what you want. If magic is like casting Imperio on someone to make them do what you want, then science is like knowing them so well that you can convince them it was their own idea all along. It’s a lot more difficult than waving a wand, but it works when wands fail, just like if the Imperiusfailed you could still try persuading a person.

I’m not sure why you’d use persuasion/marketing as a shiny metaphor for science, other than its the theme of this chapter. ”If you know science you can manipulate people as if you were literally in control of them” seems like a broad, and mostly untrue claim. AND IT FOLLOWS IMMEDIATELY AFTER HARRY EXPLAINED THE MOON LANDING TO DRACO. Science can take you to the fucking moon, maybe thats enough.

This chapter also introduces comed-tea, a somewhat clever pun drink. If you open a can, at some point you’ll do a spit-take before finishing it. I’m not sure the point of this new magical introduction, hopefully Hariezer gets around to exploring it (seriously, hopefully Hariezer begins to explore ANYTHING to do with the rules of magic. I’m 7 full chapters in and this fanfic has paid lip service to science without using it to explore magic at all).

Chapter summary: Hariezer makes it to platform 9 3/4, ditches Ron as somehow sub-human. Has a conversation with Draco that is mostly framed as conversation-as-explicit manipulation between Hariezer and Draco, and its very ham-fisted, but luckily Hariezer assures us its actual masterful manipulation, saying things like this, repeatedly:

And Harry couldn’t help but notice how clumsy, awkward, graceless his attempt at resisting manipulation / saving face / showing off had appeared compared to Draco.

Homework for the interested reader: next time you are meeting someone new, share something embarrassingly personal and then ask them immediately to reciprocate, explicitly saying ‘it’ll make us good friends.’ See how that works out for you.

WHAT DO PEOPLE SEE IN THIS? It wouldn’t be so bad, but we are clearly supposed to identify with Hariezer, who looks at Draco as someone he clearly wants on his side, and who instantly dismisses someone (with no “Bayesian updates” whatsoever as basically less than human). I’m honestly surprised that anyone read past this chapter. But I’m willing to trudge on, for posterity. Two more glasses of scotch, and then I start chapter 8. I’m likely to need alcohol to soldier on from here on out.

Side note: I’ve consciously not mentioned all the “take over the world” Hariezer references, but there are probably 3 or 4 per chapter. They seem at first like bad jokes, but they keep getting revisited so much that I think Hariezer’s explicit goal is perhaps not curiosity driven (figure out the rules of magic), but instead power driven (find out the rules of magic in order to take over the world). He assures Draco he really is Ravenclaw, but if he were written with consistency maybe he wouldn’t need to be? Hariezer doesn’t ask questions (like I would imagine a Ravenclaw would), he gives answers. Thus far, he has consistently decided the wizarding world has nothing to teach him. Arithmancy books he finds only go up to trigonometry, etc. He certainly has shown only limited curiosity this far. Its unclear to me why a curiosity driven, scientist character would feel a strong desire to impress, manipulate Draco Malfoy, as written here. This is looking less like a love-song to science, and more a love-song to some weird variant of How to Win Friends and Influence People.

A few Observations Regarding Hariezer Yudotter

After drunkenly reading chapters 8,9 and 10 last night (I’ll get to the posts soon, hopefully), I was flipping channels and somehow settled on an episode of that old TV show with Steve Urkel (bear with me, this will get relevant in a second).

In the episode, the cool kid Eddie gets hustled at billiards, and Urkel comes in and saves the day because his knowledge of trigonometry and geometry makes him a master at the table.

I think perhaps this is a common dream of the science fetishist- if only I knew ALL OF THE SCIENCE I would be unstoppable at everything. Hariezer Yudotter is a sort of wish fulfillment character of that dream. Hariezer isn’t motivated by curiosity at all really, he wants to grow his super-powers by learning more science. Its why we can go 10 fucking chapters without Yudotter really exploring much in the way of the science of magic (so far I count one lazy paragraph exploring what his pouch can do, in 10 chapters). Its why he constantly describes his project as “taking over the world.” And its frustrating, because this obviously isn’t a flaw to be overcome its part of Yudotter’s “awesomeness.”

I have a phd in a science, and it has granted me these real world super-powers:

  1. I fix my own plumbing, do my own home repairs,etc.
  2. I made a robot of legos and a raspberry pi that plays connect 4 incredibly well (my robot sidekick, I guess)
  3. Via techniques I learned in the sorts of books that in fictional world Hariezer uses to be a master manipulator, I can optimize ads on webpages such that up to 3% of people will click on them (that is, seriously, the power of influence in reality. Not Hannibal Lector but instead advertisers trying to squeeze an extra tenth of a percent on conversions), for which companies sometime pay me
  4. If you give me a lot of data, I can make a computer find patterns in it, for which companies sometimes pay me.

Thats basically it. Back when I worked in science, I spent nearly a decade of my life calculating various background processes related to finding a Higgs boson, and I helped design some software theorists now use to calculate new processes quickly. These are the sorts of projects scientists work on, and most days its hard work and total drudgery, and there is no obvious ‘instrumental utility’- BUT I REALLY WANTED TO KNOW IF THERE WAS A HIGGS FIELD.

And thats why I think the Yudotter character doesn’t feel like a scientist- he wants to be stronger, more powerful, take over the world, but he doesn’t seem to care what the answers are. Its all well and good to be driven, but most importantly, you have to be curious.

HPMOR 8: Back to the inoffensive chapters of yesteryear

And a dramatic tonal shift and we are back to a largely inoffensive chapter.

There is another lesson in this chapter, this time the lesson is confirmation bias (though Yudkowsky/Hariezer refer to it as ‘positive bias’), but once again, the pedagogical choices are strange. As Hariezer winds into his lesson to Hermione, she thinks the following:

She was starting to resent the boy’s oh-so-superior tone…but that was secondary to finding out what she’d done wrong.

So Yudkowsky knows his Hariezer has a condescending tone, but he runs with it. So as a reader, if I already know the material I get to be on the side of truth, and righteousness and I can condescend to the simps with Hariezer, OR, I don’t know the material, and then Hermione is my stand in, and I have to swallow being condescended to in order to learn.

Generally, its not a good idea when you want to teach someone something to immediately put them on the defensive- I’ve never stood in front of a class, or tutored someone by saying

"The sad thing is you probably did everything the book told you to do… unless you read the very, very best sort of books, they won’t quite teach you how to do science properly…

And Yudkowsky knows enough that his tone is off-putting to point to it. So I wonder- is this story ACTUALLY teaching people things? Or is it just a way for people who already know some of the material to feel superior to Hariezer’s many foils? Do people go and read the sequences so that they can move from Hariezer-foil, to Hariezer’s point of view? (these are not rhetorical questions, if anyone has ideas on this).

As for the rest of the chapter- its good to see Hermione merits as human, unlike Ron. There is a strange bit in the chapter where Neville asks a Gryffindor prefect to find his frog, and the prefect says no (why? what narrative purpose does this serve?).

Chapter summary: Hariezer meets Neville and Hermione on the train to Hogwarts. Still no actual exploration of magic rules. None of the fun candy of the original story.

HPMOR 9 and 10

EDIT: I made a drunken mistake in this one, see this response. I do think my original point still goes through because the hat responds to the attempted blackmail with:

I know you won’t follow through on a threat to expose my nature, condemning this event to eternal repetition. It goes against the moral part of you too strongly, whatever the short-term needs of the part of you that wants to win the argument.

So the hat doesn’t say “I don’t care about this,” the hat says “you won’t do it.” My point is, however, substantially weakened.

END EDIT

Alright, the lagavulin is flowing, and I’m once more equipped to pontificate.

These chapters are really one chapter split in two. I’m going to use them to argue against Yudkowsky’s friendly AI concept a bit. There is this idea, called ‘orthgonality’ that says that an AIs goals can be completely independent of its intelligence. So you can say ‘increase happiness’ and this uber-optimizer can tile the entire universe with tiny molecular happy faces, because its brilliant at optimizing but incapable of evaluating its goals. Just setting the stage for the next chapter.

In this chapter, Harry gets sorted. When the sorting hat hits his head, Harry wonders if its self-aware, which because of some not-really-explained magical hat property, instantly makes the hat self-aware. The hat finds being self aware uncomfortable, and Hariezer worries that he’ll destroy an intelligent being when the hat is removed. The hat assures us that the hat cares only for sorting children. As Hariezer notes

It [the hat] was still imbued with only its own strange goals…

Even still, Hariezer manages to blackmail the hat- he threatens to tell all the other kids to wonder if the hat is self-aware. The hat concedes to the demand.

So how does becoming self-aware over and over effect the hat’s goals of sorting people? It doesn’t. The blackmail should fail. Yudkowsky imagines that the minute it became self-aware, the hat couldn’t help but pick up some new goals. Even Yudkowsky imagines that becoming self-aware will have some effects on your goals.

This chapter also has some more weirdly personal seeming moments when the line between Yudkowsky’s other writing and HPMOR breaks down completely.

Summary: Harry gets sorted into ravenclaw.

I am immensely frustrated that I’m 10 chapters into this thing, and we still don’t have any experiments regarding the rules of magic.

HPMOR 11

Chapter 11 is “omake.” This is a personal pet-peeve of mine, because I’m a crotchety old man at heart. The anime culture takes Japanese words, for which we have perfectly good english words, and snags them (kawaii/kawaisa is a big one). Omake is another one. I have nothing against Japanese (I’ve been known to speak it), just don’t like unnecessary loaner words in general. I know this is my failing, BUT I WANT SO BAD TO HOLD IT AGAINST THIS FANFIC.

Either way, I’m skipping the extra content, because I can only take so much.

HPMOR 12

Nothing much in this chapter. Dumbledore gives his post-dinner speech.

Harry cracks open a can of comed-tea and does the requisite spit-take when Dumbledore starts his dinner speech with random nonsense. He considers casting a spell to make his sense of humor very specific, and then he can use comed-tea to take over the world.

Chapter summary: dinner is served and eaten

HPMOR 13: Bill and Teds Excellent Adventure

There is a scene in Bill and Ted’s Excellent Adventure, toward the end, where they realize their time machine gives them super-human powers. They need to escape a jail, so they agree to get the keys later and travel back in time and hide them, and suddenly there the keys are. After yelling to remember a trash can, they have a trash can to incapacitate a guard with,etc. They can do anything they want.

Anyway, this chapter is that idea, but much longer. The exception is that we don’t know there has been a time machine (actually, I don’t KNOW for sure thats what it is, but the Bill and Ted fan in me says thats what happened this chapter, I won’t find out until next chapter. If I were a Bayesian rationalist, I would say that the odds ratio is pi*10^^^^3 in my favor. ).

Hariezer wakes up and finds a note saying he is part of a game. Everywhere he looks, as if by magic he finds more notes, deducting various game “points,” and some portraits have messages for him. The notes lead him to a pack of bullies beating up some hufflepuffs, and pies myseriously appears for Hariezer to attack with. The final note tells him to go to Mcgonagall’s office, and the chapter ends.

I assume next chapter, Hariezer will recieve his time machine and future Hariezer will use it to set up the “game” as a prank on past Hariezer. Its a clever enough chapter.

This chapter was actually decent, but what the world really needs is Harry Potter/Bill and Ted’s Excellent Adventure cross over fiction.

HPMOR 14: Lets talk about computability

This chapter has created something in my brain like Mr Burn’s Three Stooges Syndrome. So many things I want to talk about, I don’t know where to start!

First, I was slightly wrong about last chapter. It wasn’t a time machine Hariezer used to accomplish the prank in the last chapter, it was a time machine AND an invisibility cloak. BIll and Ted did not lead me astray.

On to the chapter- Hariezer gets a time machine (Hariezer lives 26 hour days, so he needs he is given a time turner to correct his sleep schedule) this prompts this:

Say, Professor McGonagall, did you know that time-reversed ordinary matter looks just like antimatter? Why yes it does! Did you know that one kilogram of antimatter encountering one kilogram of matter will annihilate in an explosion equivalent to 43 million tons of TNT? Do you realise that I myself weigh 41 kilograms and that the resulting blast would leave A GIANT SMOKING CRATER WHERE THERE USED TO BE SCOTLAND?

Credit where credit is due- this is correct physics. In fact, its completely possible (though a bit old-fashioned and unwieldy), to treat quantum field theory such that all anti-matter is simply normal matter moving backward in time. Here is an example, look at this diagram:

If we imagine time moving from the bottom of the diagram toward the top, we see two electrons traveling forward in time, and exchanging a photon and changing directions.

But now imagine time moves left to right in the diagram instead- what we see is one electron and one positron coming together and destroying each other, and then a new pair forming from the photon. BUT, we COULD say that what we are seeing is really an electron moving forward in time, and an electron moving backward in time. The point where they “disappear” is really the point where the forward moving electron changed directions and started moving backward in time.

This is probably very confusing, if anyone wants a longer post about this, I could probably try for it sober. I need to belabor this though- the takeaway point I need you to know- the best theory we have of physics so far can be interpreted as having particles that change direction in time, AND HARIEZER KNOWS THIS AND CORRECTLY NOTES IT.

Why is this important? Because a paragraph later he says this:

You know right up until this moment I had this awful suppressed thought somewhere in the back of my mind that the only remaining answer was that my whole universe was a computer simulation like in the book Simulacron 3 but now even that is ruled out because this little toy ISN’T TURING COMPUTABLE! A Turing machine could simulate going back into a defined moment of the past and computing a different future from there, an oracle machine could rely on the halting behavior of lower-order machines, but what you’re saying is that reality somehow self-consistently computes in one sweep using information that hasn’t… happened… yet..

This is COMPLETE NONSENSE (this is also terrible pedagogy again, either you know what Turing computable means are you drown in jargon). For this discussion, Turing computable means ‘capable of being calculated using a computer’ The best theory of physics we have (a theory Hariezer already knows about) allows the sort of thing that Hariezer is complaining about. Both quantum mechanics and quantum field theory are Turing computable. Thats not to say Hariezer’s time machine won’t require you to change physics a bit- you definitely will have to, but its almost certainly computable.

Now computable does not mean QUICKLY computable (or even feasibly computable). The new universe might not be computable in polynomial time (quantum field theory may not be, at least one problem with in it, the fermion sign problem, is not).

I don’t think the time machine makes P = NP either. Having a time machine will allow you to speed up computations (you could wait until a computation was done, and then send the answer back in time). However, Hariezer’s time machine is limited, it can only be used to move back 6 hours total, and can only be used 3 time in a day, so I don’t think it could generally solve an NP-complete problem in polynomial time (after your 6 hour head start is up, things proceed at the original scaling). If you don’t know anything about computational complexity, I guess if I get enough asks I can explain it in another, non-potter post.

But my point here is- the author here is supposedly an AI theorist. How is he flubbing computability stuff? This should be bread and butter stuff.

I have so much more to say about this chapter. Another post will happen soon.

Edit: I was’t getting the P = NP thing, but I get the argument now (thanks Nostalgebraist), the idea is that you say “I’m going to compute some NP problem and come back with the solution” and then ZIP, out pops another you from the time machine, and hands you a slip of paper with the answer on it. Now you have 6 hours to verify the calculation, and then zip back to give it your former self.

But any problem in NP is checkable in P, so for any problem small enough to be checkable in 6 hours (which is a lot of problems, including much of NP), is now computable in no time at all. Its not a general P = NP, but its much wider in applicability than I was imagining.

HPMOR 14 continued: Comed Tea, Newcomb’s problem, science

One of the odd obsession’s of LessWrong is an old decision theory problem called Newcomb’s Paradox. It goes like this- a super intelligence that consistently predicts correctly challenges you to a game. There are two boxes, A and B. And you are allowed to take one or both boxes.

Inside box A is $10, and inside box B the super intelligence has already put $10,000 IF AND ONLY IF it predicted you will only take box B. What box should you take?

The reason this is a paradox is that one group of people (call them causal people) might decide that because the super intelligence ALREADY made its call, you might as well take both boxes. You can’t alter the past prediction.

Other people (call these LessWrongians) might say, ‘well, the super intelligence is always right, so clearly if I take box B I’ll get more money’. This camp includes LessWrongians. Yudkowsky himself had tried to formalize a decision theory that picks box B, that involved allowing causes to propagate backward in time.

A third group of people (call them ‘su3su2u1 ists’) might say “this problem is ill-posed. The idea of the super-intelligence might well be incoherent, depending on your model of how decisions are made,” Here is why- imagine human decisions can be quite noisy. For instance, what if I flip an unbiased coin to decide which box to take. Now the super-intelligence can only have had a 50/50 chance to successfully predict which box I’d take, which contradicts the premise.

There is another simple way to show the problem is probably ill-posed. Imagine another we take another super-intelligence of the same caliber as the first (call the first 1 and the second 2). 1 offers the same game to 2, and now 2 takes both boxes if it predicts that 1 put the money in box B. It takes only box B if 1 did not put the money in box B. Obviously, either intelligence 1 is wrong, or intelligence 2 is wrong, which contradicts the premise, so the idea must be inconsistent (note, you can turn any person into super-intelligence number 2 by making the boxes transparent).

Anyway, Yudkowsky has a pet decisions theory he has tried to formalize that allows causes to propagate backward in time. He likes this approach because you can get the LessWrongian answer to Newcomb every time. The problem is, his formalism has all sorts of problems with inconsistency because of the issues I raised about the inconsistency of a perfect predictor.

Why do I bring this up? Because Hariezer decides in this chapter that comed-tea MUST work by causing you to drink it right before something spit-take worthy happens. The tea predicts the humor, and then magics you into drinking it. Of course, he does no experiments to test this hypothesis at all (ironic that just a few chapters ago he lecture Hermione about only doing 1 experiment to test her idea).

So unsurprisingly perhaps, the single most used magic item in the story thus far is a manifestation of Yudkowsky’s favorite decision theory problem.

And my final note from this chapter- Hariezer drops this on us, regarding the brain’s ability to understand time travel:

Now, for the first time, he was up against the prospect of a mystery that was threatening to be permanent…. it was entirely possible that his human mind never could understand, because his brain was made of old-fashioned linear-time neurons, and this had turned out to be an impoverished subset of reality.

This seems to misunderstand the nature of mathematics and its relation to science. I can’t visualize a 4 dimensional curved space,certainly not the way I visualize 2 and 3d objects. But that doesn’t stop me from describing it and working with it as a mathematical object.

Time is ALREADY very strange and impossible to visualize. But mathematics allows us to go beyond what our brain can visualize to create notations and languages that let us deal with anything we can formalize and that has consistent rules. Its amazingly powerful.

I never thought I’d see Hariezer Yudotter, who just a few chapters back was claiming science could let us perfectly manipulate and control people (better than an imperio curse, or whatever the spell that lets you control people) argue that science/mathematics couldn’t deal with linear time.

I hope that this is a moment where in later chapters we see growth from Yudotter, and he revisits this last assumption. And I hope he does some experiments to test his comed-tea hypothesis. Right now it seems like experiments are things Hariezer asks people around him to do (so they can see things his way), but for him pure logic is good enough.

Chapter summary: I drink three glasses of scotch. Hariezer gets a time machine.

HPMOR 15: In which, once again, I want more science

I had a long post but the internet ate it earlier this week, so this is try 2. I apologize in advance, this blog post is mostly me speculating about some magi-science.

This chapter begins the long awaited lessons in magic. The topic of today’s lesson consists primarily of one thing, don’t transfigure common objects into food or drink.

Mr. Potter, suppose a student Transfigured a block of wood into a cup of water, and you drank it. What do you imagine might happen to you when the Transfiguration wore off?” There was a pause. “Excuse me, I should not have asked that of you, Mr. Potter, I forgot that you are blessed with an unusually pessimistic imagination -“ "I’m fine," Harry said, swallowing hard. "So the first answer is that I don’t know,” the Professor nodded approvingly, “but I imagine there might be… wood in my stomach, and in my bloodstream, and if any of that water had gotten absorbed into my body’s tissues - would it be wood pulp or solid wood or…” Harry’s grasp of magic failed him. He couldn’t understand how wood mapped into water in the first place, so he couldn’t understand what would happen after the water molecules were scrambled by ordinary thermal motions and the magic wore off and the mapping reversed.

We get a similar warning regarding transfiguring things into any gasses or liquids:

You will absolutely never under any circumstances Transfigure anything into a liquid or a gas. No water, no air. Nothing like water, nothing like air. Even if it is not meant to drink. Liquid evaporates, little bits and pieces of it get into the air.

Unfortunately, once again, I want the author to take it farther. Explore some actual science! What WOULD happen if that wood-water turned back into wood in your system?

So lets take a long walk off a short speculative pier together, and try to guess what might happen. First, we’ll assume magic absorbs any major energy differences and smooths over any issues at the time of transition. Otherwise when you magic in a few wood large wood molecules in place of much smaller water molecules, there will suddenly be lots of energy from the molecules repelling each other (this is called a steric mismatch) which will likely cause all sorts of problems (like a person exploding).

To even begin to answer, we have to pick a rule for the transition. Lets assume each water molecule turns into one “wood molecule” (wood is ill-defined on a molecular scale, its made up lots of shit. However, that shit is mostly long hydrocarbon chains called polysaccharides.)

So you’d drink the water, which gets absorbed pretty quickly by your body (any thats lingering in your gut unabsorbed will just turn into more fiber in your diet). After awhile, it would spread through your body, be taken up by your cells, and then these very diffuse water molecules would turn into polysaccharides. Luckily for you, your body probably knows how to deal with this, polysaccharides are hanging out all over your cells anway. Maybe somewhat surprisingly, you’d probably be fine. I think for lots of organic material, swapping out one organic molecule with another is likely to not harm you much. Of course, if the thing you swap in is poison, thats another story.

Now, I’ve cheated somewhat- I could pick another rule where you’d definitely die. Imagine swapping in a whole splinter of wood for each water molecule. You’d be shredded. The details of magic matter here, so maybe a future chapter will give us the info needed to revisit this.

What if instead of wood, we started with something inorganic like gold? If the water molecules turn into elemental gold (and you don’t explode from steric mismatches mentioned above), you’d be fine as long as the gold didn’t ionize. Elemental gold is remarkably stable, and it takes quite a bit of gold to get any heavy metal poisoning from it.

On the other hand, if it ionizes you’ll probably die. Gold salts (which split into ionic gold + other stuff in your system) have a semi-lethal doses (the dose that kills half of the people who take it) of just a few mg per kg, so a 70 kg person couldn’t survive more than 5g or so of the salt, which is even less ionic gold. So in this case, as soon as the spell wore off you’d start to be poisoned. After a few hours, you’d probably start showing signs of liver failure (jaundice, etc).

Water chemistry/physics is hard, so I have no idea if the gold atoms will actually ionize. Larger gold crystals definitely would not, and they are investigating using gold nanoparticles for medicine, which are also mostly non-toxic. However, individual atoms might still ionize.

What if we don’t drink the water? What if we just get near a liquid evaporating? Nothing much at all, as it turns out. Evaporation is a slow process, as is diffusion.

Diffusion constants are usually a few centimeters^2 per second, and diffusion is a slow process that moves forward with the square root of time (to move twice as far it takes 4 times as much time).

So even if the transformation into water lasts a full hour, a single water molecule that evaporates from the glass will travel less than 100 centimeters! So unless you are standing with your face very close to the glass, you are unlikely to encounter even a single evaporated molecule. Even with your face right near the glass, that one molecule will mostly likely just be breathed in and breathed right back out. You have a lot of anatomic dead-space in your lungs in which no exchange takes place, and the active area is optimized for picking up oxygen.

So how about transfiguring things to a gas? What happens there? Once again, this will depend on how we choose the rules of magic. When you make the gas, does it come in at room temperature and pressure? If so, this sets the density. Then you can either bring in an equal volume of gas to the original object with very few molecules, or you bring in an equal number of molecules, with a very large density.

At an equal number of molecules, you’ll get hundreds of liters of diffuse gas. Your lungs are only hold about 5 liters, so you are going to get a much smaller dose then you’d get from the water (a few percent at best), where all the molecules get taken up by your body. Also, your lungs won’t absorb most of the gas, much will get blown back out, further lowering the dose.

If its equal volume to the original object, then there will be very few gas molecules over a small area, and the diffusion argument applies- unless you get very near where you created the gas you aren’t likely at all to breathe any in.

Thus concludes a bit of speculative magi-science guess work. Sorry if I bored you.

Anyway- this chapter, I admit, intrigued me enough to spend some time thinking about what WOULD happen if something un-transfigured inside you. Not a bad chapter, really, but it again feels a tad lazy. We get some hazy worries about liquids evaporating (SCIENCE!) but no order-of-magnitude estimate about whether or not it matters (does not, unless maybe you boiled the liquid you made). There are lots of scientific ideas the author could play with, but they just get set aside.

As for the rest of the chapter, Hariezer gets shown up by Hermione, who is out-performing him and has already read her school books. A competition for grades is launched.

HPMOR 16: Shades of Ender’s game

I apologize for the longish break from HPMOR, sometimes my real job calls.

This chapter mostly comprises Hariezer’s first defense against the dark arts class. We meet the ultra-competent Quirrel (although I suppose like in the original its really the ultra-competent Voldemort) for the first time.

The lessons open with a bit of surprising anti-academic sentiment- Quirrel gives a long speech about how you needn’t learn to defend yourself against anything specific in the wizarding world because you could either magically run away or just use the instant killing spell, so the entire “Ministry-mandated” course with its “useless” textbooks is unnecessary. Of course, this comes from the mouth of the ostensible bad guy, so its unclear how much we are supposed to be creeped out by this sentiment (though Hariezer applauds).

After this, we get to the lesson. After teaching a light attack spell, Quirrel asks Hermoine (who mastered it fastest) to attack another student. She refuses, so Quirrel moves on to Malfoy who is quick to acquiesce by shooting Hermoine.

Then Quirrel puts Hariezer on the spot and things get sort of strange. When asked for unusual combat uses of everyday items, Hariezer comes up with a laundry list of outlandish ways to kill people, which leads Quirrel to observe that for Hariezer Yudotter nothing is defensive- he settles only for the destruction of his enemy. This feels very Ender’s game (Hariezer WINS, and that makes him dangerous), and sort of a silly moment.

Chapter summary: weirdly anti-academic defense against the dark arts lesson. We once more get magic, but no rules of magic.

HPMOR 17: Introducing Dumbledore, and some retreads on old ideas

This chapter opens with a little experiment with Hariezer trying to use the time turner to verify an NP-complete problem, as we discussed in a previous chapter section. Since its old ground, we won’t retread it.

From here, we move on to the first broomstick lesson, which proceeds much like the book, only with shades of elitism. Hariezer drops this nugget on us:

There couldn’t possibly be anything he could master on the first try which would baffle Hermione, and if there was and it turned out to be broomstick riding instead of anything intellectual, Harry would just die.

Which feels a bit like the complete dismissal of Ron earlier. So the anti-jock Hariezer, who wouldn’t be caught dead being good at broomsticking doesn’t get involved in racing around to try to get Neville’s remember all, instead the entire class ends up in a stand off, wands drawn. So Hariezer challenges the Slytherin who has it to a strange duel. Using his time turner in proper Bill and Ted fashion, he hides a decoy remember all and wins. Its all old stuff at this point, I’m starting to worry there is nothing new under the sun- more time turner, more Hariezer winning (in case we don’t get it, there is a conversation with McGonagall where Hariezer once more realizes he doesn’t even consider NOT winning).

AND THEN we meet Dumbledore, who is written as a lazy man’s version of insane. He’ll say something insightful, drop a Lord of the Ring’s quote and then immediately do something batshit. One moment he is trying to explain that Harry can trust him, the next he is setting a chicken on fire (yes this happens). In one baffling moment, he presents Hariezer with a big rock, and this exchange happens:

So… why do I have to carry this rock exactly?” "I can’t think of a reason, actually," said Dumbledore. "…you can’t." Dumbledore nodded. “But just because I can’t think of a reason doesn’t mean there is no reason.” "Okay," said Harry, "I’m not even sure if I should be saying this, but that is simply not the correct way to deal with our admitted ignorance of how the universe works."

Now, if someone gave you a large heavy rock and said “keep this on you, just in case” how would you begin to tell them they’re wrong? Here is Hariezer’s approach:

How can I put this formally… um… suppose you had a million boxes, and only one of the boxes contained a diamond. And you had a box full of diamond-detectors, and each diamond-detector always went off in the presence of a diamond, and went off half the time on boxes that didn’t have a diamond. If you ran twenty detectors over all the boxes, you’d have, on average, one false candidate and one true candidate left. And then it would just take one or two more detectors before you were left with the one true candidate. The point being that when there are lots of possible answers, most of the evidence you need goes into just locating the true hypothesis out of millions of possibilities - bringing it to your attention in the first place. The amount of evidence you need to judge between two or three plausible candidates is much smaller by comparison. So if you just jump ahead without evidence and promote one particular possibility to the focus of your attention, you’re skipping over most of the work.

Thank God Hariezer was able to use his advanced reasoning skills to make an analogy with diamonds in boxes to explain WHY THE IDEA THAT CARRYING A ROCK AROUND FOR NO REASON IS STUPID. This was the chapter’s rationality idea- seriously, its like Yudkowsky didn’t even try on this one.

Chapter summary: Hariezer sneers at broomstick riding, some (now standard) time turner hijinks, Hariezer meets a more insane than wise Dumbledore

HPMOR 18: What?

I wanted to discuss the weird anti-university/school-system under currents of the last few chapters, but I started into chapter 18 and it broke my brain.

This chapter is absolutely ludicrous. We meet Snape for the first time, and he behaves as you’d expect from the source material. He makes a sarcastic remark and asks Hariezer a bunch of questions Hariezer does not know the answer to.

This leads to Hariezer flipping out:

The class was utterly frozen. "Detention for one month, Potter," Severus said, smiling even more broadly. "I decline to recognize your authority as a teacher and I will not serve any detention you give." People stopped breathing. Severus’s smile vanished. “Then you will be -” his voice stopped short. "Expelled, were you about to say?" Harry, on the other hand, was now smiling thinly. "But then you seemed to doubt your ability to carry out the threat, or fear the consequences if you did. I, on the other hand, neither doubt nor fear the prospect of finding a school with less abusive professors. Or perhaps I should hire private tutors, as is my accustomed practice, and be taught at my full learning speed. I have enough money in my vault. Something about bounties on a Dark Lord I defeated. But there are teachers at Hogwarts who I rather like, so I think it will be easier if I find some way to get rid of you instead.”

Think about this- THE ONLY THINGS SNAPE HAS DONE are make a snide comment and ask Hariezer a series of questions he doesn’t know the answer to.

The situation continues to escalate, until Hariezer locks himself in a closet and uses his invisibility cloak and time turner to escape the classroom.

This leads to a meeting with the headmaster where Hariezer THREATENS TO START A NEWSPAPER CAMPAIGN AGAINST SNAPE (find a newspaper interested in the ‘some students think professor too hard on them, for instance he asked Hariezer Yudotter 3 hard questions in a row’ story)

AND EVERYONE TAKES THIS THREAT SERIOUSLY, AS IF IT COULD DO REAL HARM. HARIEZER REPEATEDLY SAYS HE IS PROTECTING STUDENTS FROM ABUSE. THEY TAKE THIS THREAT SERIOUSLY ENOUGH THAT HARIEZER NEGOTIATES A TRUCE WITH SNAPE AND DUMBLEDORE. Snape agrees to be less demanding of discipline, Hariezer agrees to apologize.

Nowhere in this chapter does Hariezer consider that he deprived other students of the damn potions lesson. In his ruminations about why Snape keeps his job, he never considers that maybe Snape knows a lot about potions/is actually a good potions teacher.

This whole chapter is basically a stupid power struggle that requires literally everyone in the chapter to behave in outrageously silly ways. Hariezer throws a temper tantrum befitting a 2 year old, and everyone else gives him his way.

On the plus side, Mcgonagall locks down Hariezer’s time turner, so hopefully that device will stop making an appearance for awhile, its been the “clever” solution to every problem for several chapters now.

One more chapter this bad and I might have to abort the project.

HPMOR 19: I CAN’T EVEN… WHAT?…

I… this… what…

So there is a lot I COULD say here, about inconsistent characterization, ridiculously contrived events,etc. But fuck it- here is the key event of this chapter: Quirrel, it turns out, is quite the martial artist (because of course he is, who gives a fuck about genre consistency or unnecessary details, PILE IN MORE “AWESOME”). The lesson he claims to have learned from martial arts (at a mysterious dojo, because of course) that Hariezer needs to learn (as evidenced by his encounter with Snape) is how to lose.

How does Quirrel teach Hariezer “how to lose”? He calls Hariezer to the front, insists Hariezer not defend himself, and then has a bunch of slytherins beat the shit out of him.

Thats right- a character who one fucking chapter ago couldn’t handle being asked three hard questions in a row (ITS ABUSE, I’ll CALL THE PAPERS) submits to being literally beaten by a gang at a teacher’s suggestion.

An 11 year old kid, at a teachers suggestion, submits to getting beaten by a bunch of 16 year olds. All of this is portrayed in a positive light.

Ideas around chapter 19

In light of the recent anon, I’m going to attempt to give the people (person?) what they want. Also, I went from not caring if people were reading this, to being a tiny bit anxious I’ll lose the audience I unexpectedly picked up. SELLING OUT.

If we ignore the literal child abuse of the chapter, the core of the idea is still somewhat malignant. Its true throughout that Hariezer DOES have a problem with “knowing how to lose,” but the way you learn to lose is by losing, not by being ordered to take a beating.

Quirrell could have challenged Hariezer to a game of chess, he could have asked questions Hariezer didn’t know the answer to (as Snape did, which prompted the insane chapter 18), etc. But the problem is the author is so invested in Hariezer being the embodiment of awesome that even when he needs to lose for story purposes, to learn a lesson, Yudkowsky doesn’t want to let Hariezer actually lose at something. Instead he gets ordered to lose, and he isn’t ordered to lose at something in his wheel house, but in the “jock-stuff” repeatedly sneered at in the story (physical confrontation)

HPMOR 20: why is this chapter called Bayes Theorem?

A return to what passes for “normal.” No child beating in this chapter, just a long, boring conversation.

This chapter opens with Hariezer ruminating about how much taking that beating sure has changes his life. He knows how to lose now, he isn’t going to become dark lord now! Quirrell quickly takes him down a peg:

"Mr. Potter," he said solemnly, with only a slight grin, "a word of advice. There is such a thing as a performance which is too perfect. Real people who have just been beaten and humiliated for fifteen minutes do not stand up and graciously forgive their enemies. It is the sort of thing you do when you’re trying to convince everyone you’re not Dark, not -“

Hariezer protests, and we get

There is nothing you can do to convince me because I would know that was exactly what you were trying to do. And if we are to be even more precise, then while I suppose it is barely possible that perfectly good people exist even though I have never met one, it is nonetheless improbable that someone would be beaten for fifteen minutes and then stand up and feel a great surge of kindly forgiveness for his attackers. On the other hand it is less improbable that a young child would imagine this as the role to play in order to convince his teacher and classmates that he is not the next Dark Lord. The import of an act lies not in what that act resembles on the surface, Mr. Potter, but in the states of mind which make that act more or less probable

How does Hariezer take this? Does he point out “if no evidence can sway your priors, your priors are too strong?” or some other bit of logic-chop Bayes-judo? Nope, he drops some nonsensical jargon:

Harry blinked. He’d just had the dichotomy between the representativeness heuristic and the Bayesian definition of evidence explained to him by a wizard.

Where is Quirrell using bayesian evidence? He isn’t, he is neglecting all evidence because all evidence fits his hypothesis. Where does the representativeness heuristic come into play? It doesn’t.

The representative heuristic is making estimates based on how typical of a class something is. i.e. show someone a picture of a stereotypical ‘nerd’ and say “is this person more likely an english or a physics grad student?” The representative heuristic says “you should answer physics.” Its a good rule-of-thumb that psychologists think is probably hardwired into us. It also leads to some well-known fallacies I won’t get into here.

Quirrell is of course doing none of that- Quirrell has a hypothesis that fits anything Hariezer could do, so no amount of evidence will dissuade him.

After this, Quirrell and Hariezer have a long talk about science (because of course Quirrell too has a fascination with space travel). This leads to some real Less Wrong stuff.

Quirrell tells us that of course muggle scientists are dangerous because

There are gates you do not open, there are seals you do not breach! The fools who can’t resist meddling are killed by the lesser perils early on, and the survivors all know that there are secrets you do not share with anyone who lacks the intelligence and the discipline to discover them for themselves!

And of course, Hariezer agrees

This was a rather different way of looking at things than Harry had grown up with. It had never occurred to him that nuclear physicists should have formed a conspiracy of silence to keep the secret of nuclear weapons from anyone not smart enough to be a nuclear physicist

Which is a sort of weirdly elitist position- after all lots of nuclear physicists are plenty dangerous. Its not intelligence that makes you less likely to drop a bomb. But this fits the general Yudkowsky/AI fear- an open research community is less important than hiding dangerous secrets. This isn’t necessarily the wrong position, but its a challenging one that merits actual discussion.

Anyone who has done research can tell you how important the open flow of ideas is for progress. I’m of the opinion that the increasing privatization of science is actually slowing us down in a lot of ways by building silos around information. How much do we retard progress in order to keep dangerous ideas out of people’s hands? Who gets to decide what is dangerous? Who decides who gets let into “the conspiracy?” Intelligence alone is no guarantee someone won’t drop a bomb, despite how obvious it seems to Quirrell and Yudotter.

After this digression about nuclear weapons, we learn from Quirrell that he snuck into NASA and enchanted the Pioneer gold plaque that will “make it last a lot longer than it otherwise would.” Its unclear to me what that wear and tear Quirrell is protecting the plaque from. Hariezer suggest that Quirrell might have snuck a magic portrait or a ghost into the plaque, because nothing makes more sense then dooming an (at least semi) sentient being to a near eternity of solitary confinement.

Anyway, partway through this chapter, Dumbledore bursts in angry that Quirrell had Hariezer beaten. Hariezer defends him, etc. The resolution is that its agreed Hariezer will start learning to protect himself from mind readers.

Chapter summary- long, mostly boring conversation, peppered with some existential risk/we need to escape the planet rhetoric. Its also called Bayes theorem despite that theorem making no appearance whatsoever.

And a note on the really weird pedagogy- we now have Quirrell who in the books is possessed by Voldemort acting as a mouthpiece for the author. This seems like a bad choice, because at some point I assume we’ll there will be a reveal, and it will turn out the reader should have trusted Quirrell.

HPMOR 21: secretive science

So this chapter begins quite strangely- Hermione is worried that she is “bad” because she is enjoying being smarter than Hariezer. She then decides that she isn’t “bad”, its a budding romance. Thats the logic she uses. But because she won the book-reading contest against Hariezer (he doesn’t flip out, it must be because he learned “how to lose”), she gets to go on a date with him. The date is skipped over.

Next we find Hariezer meeting Malfoy in a dark basement, discussing how they will go about doing science. Malfoy is written as uncharacteristically stupid, in order to be a foil once more for Hariezer, peppering the conversation with such gems as:

Then I’ll figure out how to make the experimental test say the right answer!

"You can always make the answer come out your way,” said Draco. That had been practically the first thing his tutors had taught him. “It’s just a matter of finding the right arguments.”

We get a lot of platitudes from Hariezer about how science humbles you before nature. But then we get the same ideas Quirrell suggested previously, because “science is dangerous”, they are going to run their research program as a conspiracy.

"As you say, we will establish our own Science, a magical Science, and that Science will have smarter traditions from the very start.” The voice grew hard. “The knowledge I share with you will be taught alongside the disciplines of accepting truth, the level of this knowledge will be keyed to your progress in those disciplines, and you will share that knowledge with no one else who has not learned those disciplines. Do you accept this?”

And the name of this secretive scienspiracy?

And standing amid the dusty desks in an unused classroom in the dungeons of Hogwarts, the green-lit silhouette of Harry Potter spread his arms dramatically and said, “This day shall mark the dawn of… the Bayesian Conspiracy.”

Of course. I mentioned in the previous chapter, anyone who has done science knows that its a collaborative process that requires an open exchange of ideas.

And see what I mean about the melding of ideas between Quirrell and Hariezer? Its weird to use them both as author mouthpieces. The Bayesian Conspiracy is obviously an idea Yudkowsky is fond of, and here Hariezer gets the idea largely from Quirrell just one chapter back.

HPMOR 22: science!

This chapter opens strongly enough. Hariezer decides that the entire wizarding world has probably been wrong about magic, and don’t know the first thing about it.

Hermione disagrees, and while she doesn’t outright say “maybe you should read a magical theory book about how spells are created” (such a thing must exist), she is at least somewhat down that path.

To test his ideas, Hariezer creates a single-blind test- he gets spells from a book, changes the words or the wrist motion or what not and gets Hermione to cast them. Surprisingly, Hariezer is proven wrong by this little test. For once, the world isn’t written as insane as a foil for our intrepid hero.

It seemed the universe actually did want you to say ‘Wingardium Leviosa’ and it wanted you to say it in a certain exact way and it didn’t care what you thought the pronunciation should be any more than it cared how you felt about gravity.

There are a few anti-academic snipes, because it wouldn’t be HPMOR without a little snide swipe at academia:

But if my books were worth a carp they would have given me the following important piece of advice…Don’t worry about designing an elaborate course of experiments that would make a grant proposal look impressive to a funding agency.

Weird little potshots about academia (comments like “so many bad teachers, its like 8% as bad as Oxford,” “Harry was doing better in classes now, at least the classes he considered interesting”) have been peppered throughout the chapters since Hariezer arrived at Hogwarts. Oh academia, always trying to make you learn things that might be useful, even if they are a trifle boring. So full of bad teachers, etc. Just constant little comments attacking school and academia.

Anyway, this chapter would be one of the strongest chapters, except there is a second half. In the second half, Hariezer partners with Draco to get to the bottom of wizarding blood purity.

Harry Potter had asked how Draco would go about disproving the blood purist hypothesis that wizards couldn’t do the neat stuff now that they’d done eight centuries ago because they had interbred with Muggleborns and Squibs.

Here is the thing about science, step 0 needs to be make sure you’re trying to explain a real phenomena. Hariezer knows this, he tells the story of N-rays earlier in the chapter, but completely fails to understand the point.

Hariezer and Draco have decided, based on one anecdote (the founders of Hogwarts were the best wizards ever, supposedly) that wizards are weaker today than in the past. The first thing they should do is find out if wizards are actually getting weaker. After all, the two most dangerous dark wizards ever were both recent, Grindelwald and Voldemort. Dumbledore is no slouch. Even four students were able to make the marauders map just one generation before Harry. (Incidentally, this is exactly where neoreactionaries often go wrong- they assume things are getting worse without actually checking, and then create elaborate explanations for non-existent facts).

Anyway, for the purposes of the story, I’m sure it’ll turn out that wizards are getting weaker, because Yudkoswky wrote it. But this would have been a great chance to teach an actually useful lesson, and it would make the N-ray story told earlier a useful example, and not a random factoid.

Anyway, to explain the effect they come up with a few obvious hypotheses:

  1. Magic itself is fading.
  2. Wizards are interbreeding with Muggles and Squibs.
  3. Knowledge to cast powerful spells is being lost.
  4. Wizards are eating the wrong foods as children, or something else besides blood is making them grow up weaker.
  5. Muggle technology is interfering with magic. (Since 800 years ago?)
  6. Stronger wizards are having fewer children. (Draco = only child? Check if 3 powerful wizards, Quirrell / Dumbledore / Dark Lord, had any children.)

They miss some other obvious ones (there is a finite amount of magic power, so increasing populations = more wizards = less power per wizard, for instance. Try to come up with your own, its easy and fun).

They come up with some ways to collect some evidence- find out what the first year curriculum was throughout Hogwarts history, and do some wizard genealogy by talking to portraits.

Still, finally some science, even if half of it was infuriating.

HPMOR 23: wizarding genetics made (way too) simple

Alright, I need to preface this: I have the average particle physicists knowledge of biology (a few college courses, long ago mostly forgotten). That said, the lagavulin is flowing, so I’m going to pontificate as if I’m obviously right, so please reblog me with corrections if I am wrong.

In this chapter, Hariezer and Draco are going to explore what I think of as the blood hypothesis- that wizardry is carried in the blood, and that intermarriage with non-magical types is diluting wizardry.

Hariezer gives Draco a brief, serviceable enough description of DNA (more like pebbles than water), He lays out two models- there are lots of wizarding genes, and the more wizard genes you have, the more powerful the wizard you are. In this case, Hariezer reasons, as powerful wizards marry less powerful wizards, or non-magical types, the frequency of the magical variant of wizard genes in the general population becomes diluted. In this model, two squibs might rarely manage to have a wizard child, but they are likely to be weaker than wizard-born wizards. Call this model 1.

The other model Hariezer lays out is that magic lies on a single recessive gene. He reasons squibs have one dominant, non-magical version, and one recessive magical version of the gene. So of kids born to squibs, 1/4 will be wizards. In this version, you either have magic or you don’t, so if wizards married the non-magical, wizards themselves could become more rare, but the power of wizards won’t be diluted. Call this model 2.

The proper test between model 1 and 2, suggests Hariezer, is to look at the children born to two squibs. If about one fourth of them are wizards, its evidence of model 2, otherwise, evidence of model 1.

There is a huge problem with this. Do you see it? Here is a hint, What other predictions does model 2 make? While you are thinking about it, read on.

Before I answer the question, I want to point out that Hariezer ignores tons of other plausible models. Here is one I just made up. Imagine, for instance, a single gene that switches magic on and off, and a whole series of other genes that make you a better wizard. Maybe some double-jointed-wrist gene allows you to move your wand in unusually deft ways. Maybe some mouth-shape gene allows you to pronounce magical sounds no one else can. In this case, magical talent can be watered down as in model 1, and wizard inheritance could still look like Mendel would suggest, as in model 2.

Alright, below I’m going to answer my query above. Soon there will be no time to figure it for yourself.

Squibs are, by definition, the non-wizard children of wizard parents. Hariezer’s model 2 predicts that squibs cannot exist. It is already empirically disproven.

Hariezer, of course, does not notice this massive problem with his favored model, and Draco’s collected genealogy suggests about 6 out of 28 squib born children were wizards, so he declares model 2 wins the test.

Draco flips out, because now that he “knows” that magic isn’t being watered down by breeding he can’t join the death eaters and his whole life is ruined,etc. Hariezer is happy that Draco has “awakened as a scientist.” (I hadn’t complained about the stilted language in awhile, just reminding you that its still there), but Draco lashes out and casts a torture spell and locks Hariezer in the dungeon. After some failed escape attempts, he once against resorts to the time turner, because even now that its locked down, its the solution to every problem.

One other thing of note- to investigate the hypothesis that really strong spells can’t be cast anymore, Hariezer tries to look up a strong spell and runs into “the interdict of Merlin” that strong spells can’t be written down, only passed from wizard to wizard.

Its looking marginally possible that it will turn out that this natural secrecy is exactly whats killing off powerful magic- its not open so ideas aren’t flourishing or being passed on. Hariezer will notice that and realize his “Bayesian Conspiracy” won’t be as effective as an open science culture, and I’ll have to take back all of my criticisms around secretive science (it will be a lesson Hariezer learns, and not an idea Hariezer endorses). It seems more likely given the author’s existential risk concerns, however, that this interdict of Merlin will be endorsed.

Some more notes regarding HPMOR

There is a line in the movie Clueless (if you aren’t familiar, Clueless was an older generation’s Mean Girls) where a woman is described as a “Monet”- in that like the painting, it looks good from afar but up close is a mess.

So I’m now nearly 25 chapters into this thing, and I’m starting to think that HPMOR is this sort of a monet- if you let yourself get carried along, it seems ok-enough. It references a lot of things that a niche group of people,myself included, like (physics! computational complexity! genetics! psychology!). But as you stare at it more, you start noticing that it doesn’t actually hang together, its a complete mess.

The hard science references are subtly wrong, and often aren’t actually explained in-story (just a jargon dump to say ‘look, here is a thing you like’).

The social science stuff fairs a bit better (its less wrong ::rimshot::), but even when its explanation is correct, its power is wildly exaggerated- conversations between Quirrell/Malfoy/Potter seem to follow scripts of the form

"Here is an awesome manipulation I’m using against you"

"My, that is an effective manipulation. You are a dangerous man"

"I know, but I also know that you are only flattering me as an attempt to manipulate me." p "My, what an effective use of Bayesian evidence that is!"

Other characters get even worse treatment, either behaving nonsensically to prove how good Harry is at manipulation (as in the chapter where Harry tells off Snape and then tries to blackmail the school because Snape asked him questions he didn’t know), OR acting nonsensically so Harry can explain why its nonsensical (“Carry this rock around for no reason.” “Thats actually the fallacy of privileging the hypothesis.”) The social science/manipulation/marketing psychology stuff is just a flavoring for conversations.

No important event in the story has hinged on any of this rationality- instead basically every conflict thus far is resolved via the time turner.

And if you strip all this out, all the wrongish science-jargon and the conversations that serve no purpose but to prove Malfoy/Quirrell/Harry are “awesome” by having them repeatedly think/tell each other how awesome they are, the story has no real structure. Its just a series of poorly paced (if you strip out the “awesome” conversations, then there are many chapters where nothing happens), disconnected events. There is no there there.

HPMOR 24: evopsych Rorschach test

Evolutionary psychology is a field that famously has a pretty poor bullshit filter. Satoshi Kanazawa once published a series of articles that beautiful people will have more female children (because beauty is more important for girls) and engineers/mathematicians will have more male children (because only men need the logic-brains). The only thing his papers proved was that he is bad at statistics (in fact, Kanazawa made an entire career out of being bad at statistics, such is the state of evo-psych).

One of the core criticisms is that for any fact observed in the world, you can tell several different evolutionary stories, and there is no real way to tell which, if any is actually true. Because of this, when someone gives you an evopsych explanation for something, its often telling you more about what they believe then it is about science or the world (there are exceptions, but they are rare).

So this chapter is a long, pretty much useless conversation between Draco and Hariezer about how they are manipulating each other and Dumbledore or whatever, but smack in the middle we get this rumination:

In the beginning, before people had quite understood how evolution worked, they’d gone around thinking crazy ideas like human intelligence evolved so that we could invent better tools.

The reason why this was crazy was that only one person in the tribe had to invent a tool, and then everyone else would use it…the person who invented something didn’t have much of a fitness advantage, didn’t have all that many more children than everyone else. [SU comment- could the inventor of an invention perhaps get to occupy a position of power within a tribe? Could that lead to them having more wealth and children?]

It was a natural guess… A natural guess, but wrong.

Before people had quite understood how evolution worked, they’d gone around thinking crazy ideas like the climate changed, and tribes had to migrate, and people had to become smarter in order to solve all the novel problems.

But human beings had four times the brain size of a chimpanzee. 20% of a human’s metabolic energy went into feeding the brain. Humans were ridiculously smarter than any other species. That sort of thing didn’t happen because the environment stepped up the difficulty of its problems a little…. [SU challenge to the reader- save this climate change evolutionary argument with an ad-hoc justification]

Ending up with that gigantic outsized brain must have taken some sort of runawayevolutionary process…And today’s scientists had a pretty good guess at what that runaway evolutionary process had been….

[It was] Millions of years of hominids trying to outwit each other - an evolutionary arms race without limit - [that] had led to… increased mental capacity.

What does his preferred explanation for the origin of intelligence (people evolved to outwit each other) say about the author?

HPMOR 24/25/26: mangled narratives

This chapter is going to be entirely about the way the story is being told in this section of chapters. There is a big meatball of a terrible idea, but I’m getting sick of that low hanging fruit, so I’ll only mention it briefly in passing.

I’m a sucker for stories about con artists. In these stories, there is a tradition of breaking with the typical chronological order of story telling- instead they show the end result of the grand plan first, followed by all the planning that went into it (or some variant of that). In that way, the audience gets to experience the climax first from the perspective of the mark, and then from the perspective of the clever grifters. Yudkowsky himself successfully employs this pattern in the first chapter with the time turner.

In this chapter, however, this pattern is badly mangled. The chapter is setting up an elaborate prank on Rita Skeeter (Draco warned Hariezer that Rita was asking questions during one of many long conversations), but jumbling the narrative accomplishes literally nothing.

Here are the events, in the order laid out in the narrative

  1. Hariezer tells Draco he didn’t tell on him about the torture, and borrows some money from him

  2. (this is the terrible idea meatball) Using literally the exact same logic that Intelligent Design proponents use (and doing exactly 0 experiments), Hariezer decides while thinking over breakfast:

Some intelligent engineer, then, had created the Source of Magic, and told it to pay attention to a particular DNA marker.

The obvious next thought was that this had something to do with “Atlantis”.

  1. Hariezer meets with Dumbledore, and refuses to tell on Draco, says getting tortured is all part of his manipulation game.

  2. Fred and George Weasley meet with a mysterious man named Flume and tells him the-boy-who-lived needs the mysterious man’s help. There is a Rtia Skeeter story mentioned that says Quirrell is secretly a death eater and is training Hariezer to be the next dark lord, a story Flume says was planted by the elder Malfoy.

  3. Quirrell tells Rita Skeeter he has no dark mark, Rita ignores him.

  4. Hariezer hires Fred and George (presumably with Malfoy’s money) to perpetuate a prank on Rita Skeeter- to convince her of something totally false.

  5. Hariezer has lunch with Quirrell, reads a newspaper story with the headline

HARRY POTTER SECRETLY BETROTHED TO GINEVRA WEASLEY

This is the story the Weasley’s planted apparently (the prank), and apparently there was a lot of supporting evidence or something, because Quirrell is incredulous it could be done. And then Quirrell after speculating that Rita Skeeter could be capable of turning to a small animal, crushes a beetle.

So whats the problem with this narrative order? First, there is absoltuely no payoff to jumbling the chronology. The prank is left until the end, and its exactly what we expected- a false story was planted in the newspaper. It doesn’t even seem like that big a deal- just a standard gossip column story (of course, Harry and Quirrell react like its a huge, impossible-to-have-done prank, to be sure the reader knows its hard.)

Second, most of the scenes are redundant, they contain no new information whatsoever and they are therefore boring- the event covered in 3 (talking with Dumbledore) is covered in full in 1 (telling Malfoy he didn’t tell on him to Dumbledore). The events of 6 (Hariezer hiring the Weasley’s to prank for him) are completely covered in 4 (when the Weasley’s hire Flume, they tell him its for Hariezer). This chapter is twice as long as it should be, for no reason.

Third, the actual prank is never shown from either the marks or the grifter’s perspective. It happens entirely off-stage so to speak. We don’t see Rita Skeeter encountering all this amazing evidence about Hariezer’s betrothal and writing up her career making article. We don’t see Fred and George’s elaborate plan (although if I were a wizard and wanted to plant a false newspaper story, I’d just plant a false memory in a reporter).

What would have been more interesting, the actual con happening off-stage, or the long conversations about nothing that happen in these chapters? These chapters are just an utter failure. The narrative decisions are nonsensical, and everything continues to be tell, tell, tell, never show.

Also of note- Quirrell gives Hariezer Roger Bacon’s diary of magic, because of course thats a thing that exists.

HPMOR 27: answers from a psychologist

In order to answer last night’s science question, I spent today slaving on the streets, polling professionals for answers (i.e. I sent one email to an old college roommate who did a doctorate in experimental brain stuff). This will basically be a guest post.

Here is the response:

The first thing you need to know, this is called “the simulation theory of empathy.” Now that you have a magic google phrase, you can look up everything you’d want, or read on my (not so) young padawan.

You are correct that no one knows how empathy works, its too damn complicated, but what we can look at is motor control, and in motor control the smoking gun for simulation is mirror neurons. Rizzolatti and collaborators discovered the certain neurons in macaque monkey’s inferior frontal gyrus that related to the motor-vocabulary activate not only when they do a gesture, but also when they see someone else doing that same gesture. So maybe, says Rizzolatti, the same neurons responsible for action are also responsible for understanding action (action-understanding). This is not the only explanation, it could be a simple priming effect. This would be big support for simulation explanations of understanding others. Unfortunately, its not the only explanation for action-understanding. There are other areas of the macaque brain (in particular the superior temporal sulcus) that aren’t involved in action, but do appear to have some role in action[understanding.

It is not an understatement to say that this discoveryy of mirror neurons caused the entire field to lose their collective shit. For some reason, motor explanations for brain phenomena are incredibly appealing to large portions of the field, and always have been. James Woods (the old dead behaviorist, not the awesome actor) had a theory that thought itself was related to the motor-neurons that control speech. Its just something that the entire field is primed to lose their shit over. Some philosophers of the mind made all sorts of sweeping pronouncements (“mirror-neurons are responsible for the great leap forward in human evolution”, pretty sure thats a direct quote)

The problem is that the gold standard for monkey tests is to see what a lesion in that portion of the brain does. Near as anyone can tell, lesions in F5 (portion of the inferior frontal gyrus where the mirror neurons on) does not impair action-understanding.

The next, bigger problem for theories of human behavior is that there is no solid evidence of mirror neurons in humans. A bunch of fmri studies showed a bit of activity in one region, and then meta-studies suggested not that region, maybe some other region, etc. fmri studies are tricky Google dead salmon fmri.

But even if mirror neurons are involved in humans, there is really strong evidence they can’t be involved in action-understanding. The mirror proponents suggest speech is a strong trigger for suggested mirror neurons. For instance, in the speech system, we’ve known since Paul Broca (really old French guy) that lesions can destroy your ability to speak without understanding your ability to understand speech. This is a huge problem for models that link action-understanding to action, killing those neurons should destroy both.

Also,suggested human mirror neurons do not fire in regards to pantomime actions. Also in autism spectrum disorders, action-understanding is often impaired with no impairment to action.

So in summary, the simulation theory of empathy got a big resurgence after mirror neurons, but there is decently strong empirical evidence against a mirror-only theory of action-understanding in humans. That doesn’t mean mirror neurons have no role to play (though if they aren’t found in humans, it does mean they have no role to play), it just means that the brain is complicated. I think the statement you quoted to me would have been something you could read from a philosopher of mind in the late 80s or early 90s, but not something anyone involved in experiments would say. By the mid 2000s, a lot of that enthusiasm had pittered a bit. Then I left the field.

So on this particular bit of science, it looks like Yudkowsky isn’t wrong he is just presenting conjecture and hypothesis as settled science. Still I learned something here, I’d never encountered this idea before. I’ll have an actual post about chapter 27 tomorrow.

HPMOR 27: mostly retreads

This is another chapter where most of the action is stuff that has happened before, we are getting more and more retreads.

The new bit is that Hariezer is learning to defend himself from mental attacks. The goal, apparently, is to perfectly simulate someone other than yourself, in that way the mind reader learns the wrong things. This leads in to the full-throated endorsement of the simulation theory of empathy that was discussed by a professional in my earlier post. Credit where credit is due- this was an idea I’d never encountered before, and I do think HPMOR is good for some of that- if you don’t trust the presentation and google ideas as they come up, you could learn quite a bit.

We also find out Snape is a perfect mind-reader. This is an odd choice- in the original books Snape’s ability to block mind-reading was something of a metaphor for his character- you can’t know if you can trust him because he is so hard to read, his inscrutableness even fooled the greatest dark wizard ever, etc. It was, fundamentally, hid cryptic dodginess that helped the cause, but it also fermented the distrust that some of the characters in the story felt toward him.

Now for the retreads-

More pointless bitching about quidditch. Nothing was said here that wasn’t said in the earlier bitching about quidditch.

Snape enlists Hareizer’s help to fight anti-slytherin bullies (for no real reason, near as I can tell), the bullies are fought once more with con-artist style cleverness (much like in the earlier chapter with the time turner and invisitbility cloak. In this chapter, its just with an invisibility cloak).

Snape rewards Hariezer’s rescue with a conversation about Hariezer’s parents, during which Hariezer decides his mother was shallow, which upsets Snape. Its an odd moment, but the odd moments in HPMOR dialogue have piled up so high its almost not worth mentioning.

And the chapter culminates when the bullied slytherin tells Hariezer about Azkaban, pleading with Hariezer to save his parents. Of course, Hariezer can’t (something tells me he will in the near future), and we get this:

"Yeah," said the Boy-Who-Lived, "that pretty much nails it. Every time someone cries out in prayer and I can’t answer, I feel guilty about not being God."

The solution, obviously, was to hurry up and become God.

So another retread- Hariezer is once more making clear his motives aren’t curiosity, they are power. This was true after chapter 10, its still true now.

This is the only real action for several chapters now, unfortunately all the action feels like its already happened before in other chapters.

HPMOR 28: hacking science/map-territory

Finally we get back to some attempts to do magi-science, but its again deeply frustrating. Its more transfiguration- the only magic we have thus far explored, and it leads to a discussion of map vs. territory distinctions that is horribly mangled.

At the opening of hte chapter, instead of using science to explore magic, the new approach is to treat magic as a way to hack science itself. To that end, Hariezer tries (and fails) to transfigure something into “cure for Alzheimer’s,” and then tries (successfully) to transfigure a rope of carbon nanotubes. I guess the thought here is he can then give these things to scientists to study? Its unclear, really.

Frustrated with how useless this seems, Hermione make this odd complaint:

"Anyway," Hermione said. Her voice shook. "I don’t want to keep doing this. I don’t believe children can do things that grownups can’t, that’s only in stories."

Poor Hermione- thats the feeblest of objections, especially in a story where every character acts like they are in their late teens or twenties. Its almost as if the author was looking for some knee jerk complaint you could throw out that everyone would write-off as silly on its face.

So Hariezer decides he needs to do something adults can’t to appease Hermione. To do this, he decides to attack the constraints he knows about magic, starting with the idea that you can only transfigure a whole object, and not part of an object (a constraint I think was introduced just for this chapter?).

So Hariezer reasons: things are made out of atoms. There isn’t REALLY a whole object there,so why can’t I do part of an object? This prompted me to wonder- if you do transform part of an object, what happens at the interface? Does this whole-object constraint have something to do with the interface? I mentioned in the chapter 15 section that magicking in a lot large gold molecule into water could cause steric mismatches (just volume constraints really) with huge energy differences, hence explosions. What happens at the micro level when you take some uniform crystalline solid and try to patch on some organic material like rubber at some boundary? If you deform the (now rubbery) material, what happens when it changes back and the crystal spacing is now messed up? Could you partially transform something if you carefully worked out the interface?

It will not surprise someone who has read this far that none of these questions are asked or answered.

Instead, Hariezer thinks really hard about how atoms are real, in the process we get ruminations on the map and the territory:

But that was all in the map, the true territory wasn’t like that, reality itself had only a single level of organization, the quarks, it was a unified low-level process obeying mathematically simple rules.

This seems innocuous enough, but a fundamental mistake is being made here. For better or for worse, physics is limited in what it can tell you about the territory, it can just provide you with more accurate maps. Often it provides you with multiple, equivalent maps for the same situation with no way to choose between them.

For instance, quarks (and gluons) have this weird property- they are well defined excitations at very high energies, but not at all well-defined at low energies, where bound states become fundamental excitations. There is no such thing as a free-quark at low energy. For some problems, the quark map is useful, for many, many more problems the meson/hadron (proton,neutron,kaon,etc) map is much more useful. The same theory at a different energy scale provides a radically different map (renormalization is a bitch, and a weak coupling becomes strong).

Continuing in this vein, he keeps being unable to transform only part of an object, so he keeps trying different maps, and making the same map/territory confusion culminating in:

If he wanted power, he had to abandon his humanity, and force his thoughts to conform to the true math of quantum mechanics.

There were no particles, there were just clouds of amplitude in a multiparticle configuration space and what his brain fondly imagined to be an eraser was nothing except a gigantic factor in a wavefunction that happened to factorize,

(Side note: for Hariezer its all about power, not about curiosity, as I’ve said dozens of time now. Also, I know as much physics as anyone, and I don’t think I’ve abandoned my humanity.)

This is another example of the same problem I’m getting at above. There is no “true math of quantum mechanics.” In non-relativistic, textbook quantum mechanics, I can formulate one version of quantum mechanics on 3 space dimension 1 time dimension, and calculate things via path integrals. I can also build a large configuration space (Hilbert space) with 3 space dimensions, and 3 momentum dimensions per particle, (and one overall time dimension) and calculate things via operators on that space. These are different mathematical formulations, over different spaces, that are completely equivalent. Neither map is more appropriate than the other. Hariezer arbitrarily thinks of configuration space as the RIGHT one.

This isn’t unique to quantum mechanics, most theories have several radically different formulations. Good old newtonian mechanics has a formulation on the exact same configuration space Hariezer is thinking of.

The big point here is that the same theory has different mathematical formulations. We don’t know which is “the territory” we just have a bunch of different, but equivalent maps. Each map has its own strong suits, and its not clear that any one of them is the best way to think about all problems. Is quantum mechancis 3+1 dimensions (3 space, 1 time) or is it 6N+1 (3 space and 3 momentum + 1 time dimension)? Its both and neither (more appropriately, its just not a question that physics can answer for us).

What Hariezer is doing here isn’t separating the map and the territory, its reifying one particular map (configuration space)!

(Less important: I also find it amusing, in a physics elitist sort of way (sorry for the condescension) that Yudkowsky picks non-relativistic quantum mechanics as the final, ultimate reality. Instead of describing or even mentioning quantum field theory, which is the most low-level theory we (we being science) know of, Yudkowsky picks non-relativistic quantum mechanics, the most low-level theory HE knows.)

Anyway, despite obviously reifying a map, in-story it must be the “right” map, because suddenly he manages to transform part of an object, although he tells Hermione

Quantum mechanics wasn’t enough,” Harry said. “I had to go all the way down to timeless physics before it took.

So this is more bad pedagogy: timeless physics isn’t even a map, its the idea of a map. No one has made a decent formulation of quantum mechanics without a specified time direction (technical aside: its very hard to impose unitarity sensibly if you are trying to make time emerge from your theory, instead of being inbuilt). Its pretty far away from mainstream theory attempts, but here its presented as the ultimate idea in physics. It seems very odd to just toss in a somewhat obscure ideas as the pinnacle of physics.

Anyway, Hariezer shows Dumbledore and McGonagall his new found ability to transfigure part of an object, chapter ends.

HPMOR 29: not much here

Someone called Yudkowsky out on the questionable decision to include his pet theories as established science, so chapter 29 opens with this (why didn’t he stick this disclaimer on the chapters where the mistakes were made?):

Science disclaimers: Luosha points out that the theory of empathy in Ch. 27 (you use your own brain to simulate others) isn’t quite a known scientific fact. The evidence so far points in that direction, but we haven’t analyzed the brain circuitry and proven it. Similarly, timeless formulations of quantum mechanics (alluded to in Ch. 28) are so elegant that I’d be shocked to find the final theory had time in it, but they’re not established yet either.

He is still wrong about timeless formulations of quantum though, they aren’t more elegant, they don’t exist.

The rest of this chapter seems like its just introductory for something coming later- Hariezer, Draco and Hermione are all named as heads of Quirrell’s armies and are all trying to manipulate each other. Some complaints from Hermione that broomstick riding is jock-like and stupid, old hat by now.

There, is however, one exceptionally strange bit- apparently in this version of the world, the core plot of Prisoner of Azkaban (Scabbers the rat was really Peter Petigrew) was just a delusion that a schizophrenic Weasley brother had. Just a stupid swipe at the original book for no real reason.

HPMOR 30-31: credit where credit is due

So credit where credit is due -- these two chapters are pretty decent. We finally get some action in a chapter, there is only one bit of wrongish science, and the overall moral of the episode is a good one.

In these chapters, three teams, “armies” lead by Draco, Hariezer and Hermione, compete in a mock-battle, Quirrell’s version of a team sport. The action is more less competently written (despite things like having Neville yell “special attack”), and its more-or-less fun and quick to read. It feels a bit like a lighter-hearted version of the beginning competitions of Ender’s game (which no doubt inspired these chapters.

The overall “point” of the chapters is even pretty valuable- Hermione, who is written off as an idiot by both Draco and Hariezer splits her army and has half attack Draco and half attack Hariezer. She is seemingly wiped out almost instantly. Draco and Hariezer then fight each other nearly to death, and out pops Hermione’s army- turns out they only faked defeat. Hermione wins, and we learn that unlike Draco and Hariezer, Hermione delegated and collaborated with the rest of her team to develop strategies to win the fight. There is a (very unexpected given the tone of everything thus far) lesson about teamwork and collaboration here.

That said -- I still have nits to pick. Hariezer’s army is organized in quite possibly the dumbest possible way:

Harry had divided the army into 6 squads of 4 soldiers each, each squad commanded by a Squad Suggester. All troops were under strict orders to disobey any orders they were given if it seemed like a good idea at the time, including that one… unless Harry or the Squad Suggester prefixed the order with “Merlin says”, in which case you were supposed to actually obey.

This might seem like a good idea, but anyone who has played team sports can testify- there is a reason that you work out plays in advance, and generally have delineated roles. I assume the military has a chain of command for similar reasons, though I have never been a solider. I was hoping to see this idea for a creatively-disorganized army bite Hariezer, but it does not. There seems to be no confusion at all over orders, etc. Basically, none of what you’d expect would happen from telling an army “do what you want, disobey all orders” happens.

And it wouldn’t be HPMOR without potentially bad social science, here is today’s reference:

There was a legendary episode in social psychology called the Robbers Cave experiment. It had been set up in the bewildered aftermath of World War II, with the intent of investigating the causes and remedies of conflicts between groups. The scientists had set up a summer camp for 22 boys from 22 different schools, selecting them to all be from stable middle-class families. The first phase of the experiment had been intended to investigate what it took to start a conflict between groups. The 22 boys had been divided into two groups of 11 -

  • and this had been quite sufficient.

The hostility had started from the moment the two groups had become aware of each others’ existences in the state park, insults being hurled on the first meeting. They’d named themselves the Eagles and the Rattlers (they hadn’t needed names for themselves when they thought they were the only ones in the park) and had proceeded to develop contrasting group stereotypes, the Rattlers thinking of themselves as rough-and-tough and swearing heavily, the Eagles correspondingly deciding to think of themselves as upright-and-proper.

The other part of the experiment had been testing how to resolve group conflicts. Bringing the boys together to watch fireworks hadn’t worked at all. They’d just shouted at each other and stayed apart. What had worked was warning them that there might be vandals in the park, and the two groups needing to work together to solve a failure of the park’s water system. A common task, a common enemy.

Now, I readily admit to not having read the original Robber’s Cave book, but I do have two textbooks that reference it, and Yudkowsky gets the overall shape of the study right, but fails to mention some important details. (If my books are wrong, and Yudkowsky is right, which seems highly unlikely given his track record please let me know)

Both descriptions I have suggest the experiment had 3 stages, not two. The first stage was to build up the in-groups, then the second stage was to introduce them to each other and build conflict, and then the third stage was to try and resolve the conflict. In particular, this aside from Yudkowsky originally struck me as surprising insightful:

They’d named themselves the Eagles and the Rattlers (they hadn’t needed names for themselves when they thought they were the only ones in the park)

Unfortunately, its simply not true- during phase 1 the researchers asked the groups to come up with names for themselves, and let the social norms for the groups develop on their own. The “in-group” behavior developed before they met their rival groups.

While tensions existed from first meeting, real conflicts didn’t develop until the two groups competed in teams for valuable prizes.

This stuff matters- Yudkowsky paints a picture of humans diving so easily into tribes that simply setting two groups of boys loose in the same park will cause trouble. In reality, taking two groups of boys, encouraging them to develop group habits, group names, group customs, and then setting the groups to directly competing for scarce prizes (while researchers encourage the growth of conflicts) will cause conflicts. This isn’t just a subtlety.

HPMOR 32: interlude

Chapter 32 is just a brief interlude, nothing here really, just felt the need to put this in for completeness.

HPMOR 33: it worked so well the first time, might as well try it again

This chapter has left me incredibly frustrated. After a decent chapter, we get a terrible retread of the same thing. For me, this chapter failed so hard that I’m actually feeling sort of dejected, it undid any good will the previous battle chapter had built up.

This section of chapters is basically a retread of the dueling armies just a brief section back. Unfortunately, this second battle section flubs completely a lot of the things that worked pretty well in the first battle section. There is a lot to talk about here that I think failed, so this might be long.

There is an obvious huge pacing problem here. The first battle game happens just a brief interlude before the second battle game. Instead of spreading this game out over the course of the Hogwarts school year (or at least putting a few of the other classroom episodes in between) these just get slammed together. First battle, one interlude, last battle. That means that a lot of the evolution of the game over time, how people are reacting to it, etc. is left as a tell rather than a show. A lot of this chapter is spent dealing with big changes to Hogwarts that have been developing as student’s get super-involved in this battle game, but we never see any of that.

Imagine if Ender’s game (a book fresh on my mind because of the incredibly specific references in this chapter) were structured so that you get the first battle game, and then a flash-forward to his final battle against the aliens, with Ender explaining all the strategy he learned over the rest of that year. This chapter is about as effective as that last Ender’s game battle would be.

The chapter opens with Dumbledore and McGonagall worried about the school-

Students were wearing armbands with insignia of fire or smile or upraised hand, and hexing each other in the corridors.

Loyalty to armies over house or school is tearing the school apart!

But then we turn to the army generals- apparently the new rules of the game allowed soldiers in armies to turn traitor, and its caused the whole game to spiral out of control- Draco complains:

You can’t possibly do any real plots with all this stuff going on. Last battle, one of my soldiers faked his own suicide.

Hermione agrees, everyone is losing control of their armies because of all the traitors.

"But.. wait…" I can hear you asking, "how can that make sense? Loyalty to the armies is so absolute people are hexing each other in the corridors? But at the same time, almost all the students in the armies are turning traitor and plotting against their generals? Both of those can’t be true?" I agree, you smart reader you, both of these things don’t work together. NOT ONLY IS YUDKOWSKY TELLING INSTEAD OF SHOWING, WE ARE BEING TOLD CONTRADICTORY THINGS. Yudkowsky wanted to be able to follow through on the Robber’s Cave idea he developed earlier, but he also needed all these traitors for his plot, so he tried to run in both directions at once.

Thats not the only problem with this chapter (it wouldn’t be HPMOR without misapplied science/math concepts)- it turns out Hermione is winning, so the only way for Draco and Hariezer to try to catch up is to temporarily team up, which leads to a long explanation where Hariezer explains the prisoner’s dilemma and Yudkowsky’s pet decision theory.

Here is the big problem- In the classic prisoner’s dilemma:

If my partner cooperates, I can either:

-cooperate, in which case I spend a short time in jail, and my partner spends a short time in jail

-defect, in which case I spend no time in jail, and my partner serves a long time in jail

If my partner defects, I can either:

-cooperate, in which case I spend a long time in jail, and my partner goes free

-defect, in which case I spend a long time in jail, as does my partner.

The key insight of the prisoner’s dilemma is that no matter what my partner does, defecting improves my situation. This leads to a dominant strategy where everyone defects, even though the both-defect is worse than the both-cooperate.

In the situation between Draco and Hariezer:

If Draco cooperates, Hariezer can either:

-cooperate in which case both Hariezer and Draco both have a shot at getting first or second

-defect, in which case Hariezer is guaranteed second, Draco guaranteed 3rd place

If Draco defects, Hariezer can either

-cooperate, in which case Hariezer is guaranteed 3rd, and Draco gets 2nd.

-defect, in which case Hariezer and Draco are fighting it out for 2nd and third.

Can you see the difference here? If Draco is expected to cooperate, Hariezer has no incentive to defect- both cooperate is STRICTLY BETTER than the situation where Hariezer defects against Draco. This is not at all a prisoner’s dilemma, its just cooperating against a bigger threat. All the pontificating about decision theories that Hariezer does is just wasted breath, because no one is in a prisoner’s dilemma.

After the pointless digression about the non-prisoner’s dilemma (seriously, this is getting absurd, and frustrating- I’m hard pressed to find a single science reference in this whole thing that’s unambiguously applied correctly.).

After these preliminaries, the battle begins. Unlike the light hearted, winking reference to Ender’s game of the previous chapter, Yudkowsky feels the need to make it totally explicit- they fight in the lake, so that Hariezer can use exactly the stuff he learned from Ender’s game to give him an edge. It turns the light homage of the last battle into just the setup for the beat-you-over-the-head reference this time. There is a benefit to subtlety, and assuming your reader isn’t an idiot.

Anyway, during the battle, everyone betrays everyone and the overall competition ends in a tie.

HPMOR 34-35: aftermath of the war game

These chapters contain a lot of speechifying, but in this case it actually fits, as a resolution to the battle game. Its expected and isn’t overly long.

The language, as throughout, is still horribly stilted, but I think I’m getting used to it (when Hariezer referred to Hermione as “General of Sunshine” I almost went right past it without a mental complaint). Basically, I’m likely to stop complaining about the stilted language but its still there, its always there.

Angry at the ridiculousness of the traitors, Hemione and Draco insist that if Hariezer uses traitors in his army that they will team up and destroy him. He insists he will keep using them.

Next, Quirrell gives a long speech about how much chaos the traitors were able to create, and makes an analogy to the death eaters. He insists that the only way to guard against such is essentially fascism.

Hariezer than speaks up, and says that you can do just as much damage in the hunt for traitors as traitors can do themselves, and stands up for a more open society. The themes of these speeches can be found in probably hundreds of books, but they work well enough here.

Every army leader gets a wish, Draco and Hermione decide to wish for their houses to win the house cup. In an attempt to demonstrate his argument for truth, justice and the American way, Hariezer wishes for quidditch to no longer contain the snitch. I guess nothing will rally students around an open society like one person fucking with the sport they love.

We also find out that Dumbledore helped the tie happen by aiding in the plotting(the plot was “too complicated” for any student, according to Quirrell, so it must have been Dumbledore- apparently, ‘betray everyone to keep the score close’ is a genius master plan), but we are also introduced to a mysterious cloaked stranger who was also involved but wiped all memory of his passing.

These are ok chapters, as HPMOR chapters go.

Hariezer's Arrogance

In response to some things that kai-skai has said, I started thinking about how should we view Hariezer’s arrogance. Should we view it as a character flaw? Something Hariezer will grow and overcome? I don’t think its being presented that way.

My problem with the arrogance are several:

-the author intends for Hariezer to be a teacher. He is supposed to be the master rationalist that the reader (and other characters) learn from. His arrogance makes that off-putting. If you aren’t familiar at all with the topics Hariezer happens to be discussing, you are being condescended to along with the characters in the story (although if you know the material you get to condescend to the simpletons along with Hariezer). Its just a bad pedagogical choice. You don’t teach people by putting them on the defensive.

-The arrogance is not presented by the author as a character flaw. In the story, its not a flaw to overcome, its part of what makes him “awesome.” His arrogance has not harmed him, he hasn’t felt the need to revisit it. When he thinks he knows better than everyone else, the story invariably proves him right. He hasn’t grown or been presented with a reason to grow. I would bet a great deal of money that Hariezer ends HPMOR exactly the same arrogant twerp he starts as.

-This last one is a bit of a personal reaction. Hariezer gets a lot of science wrong (I think all of it is wrong, actually, up to where I am now), and is incredibly arrogant while doing so. I’ve taught a number of classes at the college level, and I’ve had a lot of confidently, arrogantly wrong students. Hariezer’s attitude and lack of knowledge repeatedly remind me of the worst students I ever had- smart kids too arrogant to learn (and these were physics classes, where wrong or right is totally objective).

HPMOR 36/37: Christmas break/not much happens

Like all the Harry Potter books, Yudkowsky includes a Christmas break. I note that a Christmas break would make a lot of sense toward the middle of the book, not less than <1/3 of the way through. Like the original books, this is just a light bit of relaxation

Not a lot happens over break. Hariezer is a twerp who think his parents don’t respect him enough, they go to Hermione’s house for Christmas, Hariezer yells at Hermione’s parents for not respecting her intelligence enough, the parents say Hermione and Hariezer are like an old married couple (it would have been nice to see the little bonding moments in the earlier chapters). Quirrell visits Hariezer on Christmas Eve.

HPMOR 38: a cryptic conversation, not much here

This whole chapter is just a conversation between Malfoy and Hariezer. It fits squarely into the “one party doesn’t really know what the conversation is about” mold, with Hariezer being the ignorant party. Malfoy is convinced Hariezer is working with someone other than Quirrell or Dumbledore.

HPMOR 39: your transhumanism is showing

This was a rough chapter, in which primarily Hariezer and Dumbledore have an argument about death. Hariezer takes up the transhumanist position. If you aren’t familiar with the transhumanist position on death, its basically that death is bad (duh!) and that the world is full of deathists who have convinced themselves that death is good. This usually leads into the idea that some technology will save us from death (nanotech, SENS,etc), and even if they don’t we can all just freeze our corpses to be reanimated when that whole death thing gets solved. I find this position somewhat childish, as I’ll try and get to.

So, as a word of advice to future transhumanist authors who want to write literary screeds arguing against the evil deathists, FANTASY LITERATURE IS A UNIQUELY BAD CHOICE FOR ARGUING YOUR POINT. To be fair, Yudkowsky noticed this, and lampshaded it, when Hariezer says there is no afterlife, Dumbledore argues back with:

“How can you not believe it? ” said the Headmaster, looking completely flabbergasted. “Harry, you’re a wizard! You’ve seen ghosts! ” …And if not ghosts, then what of the Veil? What of the Resurrection Stone?”

i.e. how can you not believe in an afterlife with there is a literal gateway to the fucking afterlife sitting in the ministry of magic basement. Hariezer attempts to argue his way out of this, we get this story for instance:

You know, when I got here, when I got off the train from King’s Cross…I wasn’t expecting ghosts. So when I saw them, Headmaster, I did something really dumb. I jumped to conclusions. I, I thought there wasan afterlife… I thought I could meet my parents who died for me, and tell them that I’d heard about their sacrifice and that I’d begun to call them my mother and father -

"And then… asked Hermione and she said that they were just afterimages… And I should have known! I should have known without even having to ask! I shouldn’t have believed it even for all of thirty seconds!… And that was when I knew that my parents were really dead and gone forever and ever, that there wasn’t anything left of them, that I’d never get a chance to meet them and, and, and the other children thought I was crying because I was scared of ghosts

So, first point- this could have been a pretty powerful moment if Yudkowsky had actually structured the story to relate this WHEN HARIEZER FIRST MET A GHOST. Instead, the first we hear of it is this speech. Again, tell, tell, tell, never show

Second point- what exactly does Hariezer assume is being “afterimaged?” Clearly some sort of personality, something not physical is surviving in the wizarding world after death. If fighting death is this important to Hariezer, why hasn’t he even attempted to study ghosts yet? (full disclosure, I am an atheist personally. However, if I lived in a world WITH ACTUAL MAGIC, LITERAL GHOSTS, a stone that resurrects the dead, and a FUCKING GATEWAY TO THE AFTERLIFE I might revisit that position).

Here is Hariezer’s response to the gateway to the afterlife:

That doesn’t even sound like an interesting fraud,” Harry said, his voice calmer now that there was nothing there to make him hope, or make him angry for having hopes dashed. “Someone built a stone archway, made a little black rippling surface between it that Vanished anything it touched, and enchanted it to whisper to people and hypnotize them.”

Do you see how incurious Hariezer is? If someone told me there was a LITERAL GATEWAY TO THE AFTERLIFE I’d want to see it. I’d want to test it, see it. Can we try to record and amplify the whispers? Are things being said?

Why do they think its a gateway to the afterlife? Who built it? Minimally, this could have lead to a chapter where Hariezer debunks wizarding spiritualists like a wizard-world Houdini. (Houdini spent a great deal of his time exposing mediums and psychics who ‘contacted the dead’ as frauds.) I’m pretty sure I would have even enjoyed a chapter like that.

In the context of the wizarding world, there is all sorts of non-trivial evidence for an afterlife that simply doesn’t exist in the real world. Its just a bad choice to present these ideas in the context of this story.

Anyway, ignoring what a bad choice it is to argue against an afterlife in the context of fantasy fiction, lets move on:

Dumbledore presents some dumb arguments so that Hariezer can seem wise. Hariezer tells us death is the most frightening thing imaginable, its not good,etc. Basically, death is scary, no one should have to die. If we had all the time imaginable we would actually use it. Pretty standard stuff, Dumbledore drops the ball presenting any real arguments.

So I’ll take up Dumbledore’s side of the argument. I have some bad news for Hariezer’s philosophy. You are going to die. I’m going to die. Everyone is going to die. It sucks, and its unfortunate, sure, but there is no way around it. Its not a choice! We aren’t CHOOSING death. Even if medicine can replace your body (which doesn’t seem likely in my lifetime), the sun will explode some day. Even if we get away from the solar system, eventually we’ll run out of free energy in the universe.

But you do have one choice regarding death- you can accept that you’ll die someday, or you can convince yourself there is some way out. Convince yourself that if you say the right prayers, or in the Less Wrong case, work on the right decision theory to power an AI you’ll get to live forever. Convince yourself that if you give a life insurance policy to the amateur biologists that run croynics organizations you’ll be reanimated.

The problem with the second choice is that there is an opportunity cost- time spent praying or working on silly decision theories is time that you aren’t doing things that might matter to other humans. We accept death to be more productive in life. Stories about accepting death aren’t saying death is good they are saying death is inevitable.

Edit: I take back a bit about cognitive dissonance that was here.

HPMOR 40: short follow up to 39

Instead of Dumbledore’s views, in this chapter we get Quirrell’s view of death. He agrees with Hariezer, unsurprisingly.

HPMOR 41: another round of Quirrell's battle game

It seems odd that AFTER the culminating scene, the award being handed out, and the big fascist vs. freedom speechifying that we have yet another round of Quirrell’s battle game.

Draco and Hermione are now working together against Hariezer. Through a series of circumstances, Draco has to drop Hermione off a roof to win.

Edit: I also point out that we don’t actually get details of the battle in this time, it opens with

Only a single soldier now stood between them and Harry, a Slytherin boy named Samuel Clamons, whose hand was clenched white around his wand, held upward to sustain his Prismatic Wall.

We then get a narrator summary of the battle that had lead up that moment. Again, tell, tell, tell never show.

HPMOR 42: is there a point to this?

Basically an extraneous chapter, but one strange detail at the end.

So in this chapter, Hariezer is worried that its his fault that in the battle last chapter. Hermione got dropped off a roof. Hermione agrees to forgive him as long as he lets Draco drop him off the same roof.

He takes a potion to help him fall slowly and is dropped, but so many young girls try to summon him to their arms (yes, this IS what happens) that he ends up falling, luckily Remus Lupin is there to catch him.

Afterwards, Remus and Hariezer talk. Hariezer learns that his father was something of a bully. And, for some reason, that Peter Petigrew and Sirius Black were lovers. Does anyone know what the point of making Petigrew and Black lovers would be?

Conversations

My girlfriend: “What have you been working on over there?”

Me: “Uhhhh… so…. there is this horrible Harry Potter fan fiction… you know, when people on the internet write more stories about Harry Potter? Yea, that. Anyway, this one is pretty terrible so I thought I’d read it and complain about it on the internet…. So I’m listening to me say this out loud and it sounds ridiculous, but.. well, it IS ridiculous… but…”

HPMOR 43-46: Subtle Metaphors

These chapters actually moved pretty decently. When Yudkowsky isn’t writing dialogue, his prose style can actually be pretty workman-like. Nothing that would get you to stop and marvel at the word play, but it keeps the pace brisk and moving.

Now, in JK Rowling’s original books, it always seemed to me that the dementors were a (not-so-subtle) nod to depression. They leave people wallowing in their worst memories, low energy, unable to remember the happy thoughts,etc.

In HPMOR, however, Hariezer (after initially failing to summon a patronus) decides that the dementors really represent death. You see in HPMOR, instead of relieving their saddest, most depressing memories the characters just see a bunch of rotting corpses when the dementors get near.

This does, of course, introduce new questions? What does it mean that the dementors guard Azkaban? Why don’t the prisoner’s instantly die? Why doesn’t a dementor attack just flat-out kill you?

Anyway, apparently the way to kill death is to just imagine that someday humans will defeat death, in appropriately Carl Sagan-esque language:

The Earth, blazing blue and white with reflected sunlight as it hung in space, amid the black void and the brilliant points of light. It belonged there, within that image, because it was what gave everything else its meaning. The Earth was what made the stars significant, made them more than uncontrolled fusion reactions, because it was Earth that would someday colonize the galaxy, and fulfill the promise of the night sky.

Would they still be plagued by Dementors, the children’s children’s children, the distant descendants of humankind as they strode from star to star? No. Of course not. The Dementors were only little nuisances, paling into nothingness in the light of that promise; not unkillable, not invincible, not even close.

Once you know this, your patronus becomes a human, and kills the dementor. Get it THE PATRONUS IS HUMANS (represented in this case by a human) and THE DEMENTOR IS DEATH. Humans defeat death. Very subtle.

Another large block of chapters with no science.

HPMOR 47: Racism is Bad

Nothing really objectionable here, just more conversations and plotting.

Hariezer spends much of this chapter explaining to Draco that racism is bad, and that a lot of pure bloods probably hate mudbloods because it gives them a chance to feel superior. Hariezer suggests these racist ideas are poisoning slytherin.

We also find out that Draco and his father seem to believe that Dumbledore burned Draco’s mother alive. This is clearly a departure from the original books. Hariezer agrees to take as an enemy whoever killed Draco’s mother. Feels like it’ll end up being more plots-within-plots stuff.

Another chapter with no science explored. We do find out Hariezer speaks snake language.

HPMOR 48: Utilitarianism

This chapter is actually solid as far as these things go. After learning he can talk to snakes Hariezer begins to wonder if all animals are sentient, after all snakes can talk. This has obvious implications for meat eating.

From there, he begins to wonder if plants might be sentient, in which case he wouldn’t be able to eat anything at all. This leads him to the library for research.

He also introduces scope insensitivity and utilitarianism, even though it isn’t really required at all to explain his point to Hermione. Hermione asks why he is freaking out, and instead of answering “I don’t want to eat anything that thinks and talks,” he says stuff like

"Look, it’s a question of multiplication, okay? There’s a lot of plants in the world, if they’renot sentient then they’re not important, but if plants are people then they’ve got more moral weight than all the human beings in the world put together. Now, of course your brain doesn’t realize that on an intuitive level, but that’s because the brain can’t multiply. Like if you ask three separate groups of Canadian households how much they’ll pay to save two thousand, twenty thousand, or two hundred thousand birds from dying in oil ponds, the three groups will respectively state that they’re willing to pay seventy-eight, eighty-eight, and eighty dollars. No difference, in other words. It’s called scope insensitivity.

Is that really the best way to describe his thinking? Why say something with 10 words while several hundred will do. What does scope insensitivity have to do with the idea “I don’t want to eat things that talk and think?”

Everything below here is unrelated to HPMOR and has more to do with scope insensitivity as a concept:

Now, because I have taught undergraduates intro physics, I do wonder (and have in the past)- is Kahneman’s scope insensitivity related to the general innumeracy of most people? i.e. how many people who hear that question just mentally replace literally any number with “a big number”?

The first time I taught undergraduates I was surprised to learn that most of the students had no ability to judge if their answers seemed plausible. I began adding a question “does this answer seem order of magnitude correct?” I’d also take off more points for answers that were the wrong order of magnitude, unless the student put a note saying something like “I know this is way too big, but I can’t find my mistake.”

You could ask a question about a guy throwing a football, and answers would range from 1 meter/second all the way to 5000 meters/second. You could ask a question about how far someone can hit a baseball and answers would similarly range from a few meters to a few kilometers. No one would notice when answers were wildly wrong. Lest someone think this is a units problem (Americans aren’t used to metric units), even if I forced them to convert to miles per hour, miles, or feet students couldn’t figure out if the numbers were the right order of magnitude.

So I began to give a few short talks on what I thought as basic numeracy. Create mental yardsticks (the distance from your apartment to campus might be around a few miles, the distance between this shitty college town and the nearest actual city might be around a few hundred miles,etc). When you encounter unfamiliar problems, try to relate it back to familiar ones. Scale the parameters in equations so you have dimensionless quantities * yardsticks you understand. And after being explicitly taught most of the students got better at understanding the size of numbers.

Since I began working in the business world I’ve noticed that most people never develop that skill. Stick a number in a sentence and people just mentally run right over it, you might as well have inserted some klingon phrases. Some of the better actuaries do have some nice numerical intuition, but a surprising number don’t. They can calculate, but they don’t understand what the calculations are really telling them, like Searle’s chinese room but with numbers.

In Kahneman’s scope neglect questions, there are big problems with innumeracy- if you ask people how much they’d spend on X where X is any charity that seems importantish, you are likely to get an answer of around $100. In some sense, it is scope neglect, in another sense you just max out people’s generosity/spending cash really quickly.

If your rephrase it to “how much should the government spend” you hit general innumeracy problems, and you also hit general innumeracy problems when you specify large, specific numbers of birds.

I suspect Kahneman would have gotten different results had he asked his questions varying questions as: “what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in your city?” vs. "what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in your state?" vs. "what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in the whole country?" (I actually ran this experiment on a convenience sample of students in a 300 level physics class several years ago and got 5%,8% and 10% respectively, but the differences weren’t significant, though the trend was suggestive.)

I suspect the problem isn’t that “brains can’t multiply” so much as “most people are never taught how to think about numbers.”

If anyone knows of further literature on this, feel free to pass it my way.

HPMOR 49: not much here

I thought I posted something about this last weekend, I think tumblr are it. So this will be particularly light. Hariezer notices that Quirrell knows too much (phrased as “his priors are too good”) but hasn’t yet put it together that Quirrell.

There is also (credit where credit is due) a clever working in of the second book into Yudkowsky’s world. The “interdict of Merlin” Yudkowsky invented prevents wizards from writing spells down, so Slytherin’s basilisk was placed in Hogwarts to pass spells on to “the heir of Slytherin.” Voldemort learned those secets and then killed the basilisk, so Hariezer has no shortcut to powerful spells.

HPMOR 50: In Need of an Editor

So this is basically a complete rehash again- it fits into the “Hariezer uses the time turner and the invisibility cloak to solve bullying” mold we’ve already seen a few times. The time turner + invisibility cloak is the solution to all problems, and when Yudkowsky needs a conflict, he throws in bullying. I think we’ve seen this exact conflict with this exact solution at least three other times.

In this chapter, its Hermione being bullied, he protects her by creating an alibi with his timer turner, dressing in an invisibility cloak, and whispering some wisdom in the bullies ear. Because most bullies just need the wisdom of an eleven year old whispered into their ear.

HPMOR 51-63: Economy of Language

So this block of chapters is roughly the length of the Maltese Falcon or the first Harry Potter book, probably 2/3 the length of the Hobbit. This one relatively straight-forward episode of this story is the length of The Maltese Falcon. Basically, the ratio of things-happening/words-written is terrible.

This chapter amounts to a prison break- Quirrell tells Hariezer that Bellatrix Black was innocent, so they are going to break her out. Its a weird section, given how Black escaped in the original novels (i.e. the dementors sided with the dark lord, so all he had to do was go to the dementors say “you are on my side, please let out bellatrix black, and everyone else while you are at it.”)

The plan is to have Hariezer use his patronus while Quirrell travels in snake form in his pouch. They’ll replace Bellatrix with a corpse, so everyone will just think she is dead. It becomes incredibly clear upon meeting Bellatrix that she wasn’t “innocent” at all, though she might be not guilt in the by-reason-of-insanity sense.

This doesn’t phase Hariezer, they just keep moving forward with the plan, which goes awry pretty quickly when an auror stumbles on them. Quirrell ties to kill the auror. Hariezer tries to block the killing spell and ends up knocking out Quirrell and turning his patronus off, and the plan goes to hell.

To escape, Hariezer first scares the dementors off by threatening to blast them with his uber-patronus (even death is apparently scared of death in this story). Then Quirrell wakes up, and with Quirrell’s help he transfigures a hole in the wall, and transfigures a rocket which he straps to his broomstick, and out they fly. The rocket goes so fast the aurors can’t keep up.

Its a decent bit of action in a story desperately needing a bit of action, but its marred by excessive verbosity. We have huge expanses of Hariezer talking with Quirrell, Hariezer talking to himself, Hariezer thinking about dementors, etc. Instead of a tense, taught 50 pages we get a turgid 300.

After they get to safety, Quirrell and Hariezer discuss the horror that is Azkaban. Quirrell tells Hariezer that only a democracy could produce such a torturous prison. A dark lord like Voldemort would have no use for it once got bored:

You know, Mr. Potter, if He-Who-Must-Not-Be-Named had come to rule over magical Britain, and built such a place as Azkaban, he would have built it because he enjoyed seeing his enemies suffer. And if instead he began to find their suffering distasteful, why, he would order Azkaban torn down the next day.

Hariezer doesn’t take up the pro-democracy side, and only time will tell if he goes full-on reactionary like Quirrell by the end of our story. By the end, Hariezer is ruminating on the Milgram experiment, although I don’t think its really applicable to the horror of Azkaban (its not like the dementors are “just following orders”- they live to kill).

Hariezer then uses his time turner to go back to right before the prison breakout, the perfect alibi to the perfect crime.

Dumbledore and Mcgonagall suspect Hariezer played a part in the escape, because of the use of the rocket. They ask Hariezer to use his time turner to send a message back in time (which he wouldn’t be able to do it if he had already used his turner to hide his crime).

Hariezer solves this through the time-turner-ex-machina of Quirrell knowing someone else with a time turner, because when Yudkowsky can’t solve a problem with a time turner, he solves it with two time turners.

HPMOR 64/65: respite

Chapter 64 is again “omake” so I didn’t read it.

Chapter 65 appears to be a pit-stop before another long block of chapters. Hariezer is chaffing that he has been confined to Hogwarts in order to protect him from the dark lord, so he and Quirrell are thinking of hiring a play-actor to pretend to be Voldemort, so that Quirrell can vanquish him.

These were a brief respite between the huge 12- chapter block I just got through and another giant 12 chapter block. Its looking like the science ideas are slowing down in these long chapter blocks, as the focus shifts to action. youzicha has suggested a lot of the rest will be Hariezer cleverly “hacking” his way out of situations, like the rocket in the previous 12 chapter block. The sweet spot for me has been discussing the science presented in these chapters, so between the expected lack of science and the increasing length of chapter blocks, expect slower updates.

HPMOR 66-77: absolutely, appallingly awful

There is a general problem with fanfiction (although usually not in serial fiction where things tend to stay a bit more focused for whatever reason), where the side/B-plots are written entirely in one pass instead of intertwined along side the main plot. Instead of being a pleasant diversion, the side-plot piles up in one big chunk. This is one such side-plot.

Also worth noting these chapters combine basically everything I dislike about HPMOR into one book-length bundle of horror. It was honest-to-god work to continue to power through this section. So this will be just a sketch of this awful block of chapters.

We opened with another superfluous round of the army game, in which nothing notable really happens other than some character named Daphne challenges Neville to “a most ancient duel” WHICH IS APPARENTLY A BATTLE WITH LIGHTSABERS. My eyes rolled so hard I almost had a migraine, and this was the first chapter of the block.

After the battle, Hermoine becomes concerned that women are underrepresented among heros of the wizarding world, and starts a “Society for the Promotion of Heroic Equality for Witches” or SPHEW. They star with a protest in front of Dumbledore’s office and then decide to heroine it up and put an end to bullying. You see, in the HPMOR world, bullying isn’t a question of social dynamics, or ostracizing kids. Bullying is coordinated ambushes of kids in hallways by groups of older kids, and an opportunity for “leveling-up.” The way to fight bullies in this strange world is to engage in pitched wizard-battles in the hallways (having fought an actual bully in reality as a middle schooler I can tell you that at least for me “fight back” doesn’t really solve the problem in any way). In this world, the victims of the bullying are barely mentioned and don’t have names.

And of course, the authority figures like McGonagall don’t even really show up during all of this. Students are constantly attacking each other in the hallways and no one is doing anything about it. Because the way to make your characters seem “rational” is to make sure the entire world is insane.

Things quickly escalate until 44 bullies get together to ambush the eight girls in SPHEW. A back of the envelope calculation suggests Hogwarts has maybe 300 students. So we are to expect slightly more than 10% of the population of students are the sort of “get together and plot an ambush” bullies that maybe you find in 90s highschool TV shows. Luckily, Hariezer had asked Quirrell to protect the girls, so disguised Quirrell takes down the 44 bullies.

We get a “lesson” (lesson in this context means ‘series of insanely terrible ideas’) on “heroic responsibility” in the form of Hariezer lecturing to Harmoine .

The boy didn’t blink. “You could call it heroic responsibility, maybe,” Harry Potter said. “Not like the usual sort. It means that whatever happens, no matter what, it’s always your fault… Following the school rules isn’t an excuse, someone else being in charge isn’t an excuse, even trying your best isn’t an excuse. There just aren’t any excuses, you’ve got toget the job done no matter what.”… Being a heroine means your job isn’t finished until you’ve done whatever it takes to protect the other girls, permanently.”

You know a good way to solve bullying? Expel the bullies. You know who has the power to do that? McGonagall and Dumbledore. A school is a system and has procedures in place to deal with problems. The proper response is almost always “tell an authroity figure you trust.” Being “rational” is knowing when to trust the system to do its job.

In this case, Yudkowsky hasn’t even pulled his usual trick of writing the system as failing- no one even attempts to tell an authority figure about the bullying and no authority figure engages with it, besides Quirrell who engages by disguising himself and attacking students, and Snape who secretly (unknown even to SPHEW) directs SPHEW to where the bullies will be. The system of school discipline stops existing for this entire series of chapters.

We get a final denouement between Hariezer and Dumbledore where the bullying situation is discussed by referencce to Ghandi’s passive resistance in India, WW2 and Churchill, and the larger wizarding war that feel largely overwrought because it was bullying. Big speeches about how Hermoine has been put in danger, etc ring empty because it was bullying. Yes, being bullied is traumatic (sometimes life-long traumatic), but its not WORLD WAR traumatic.

I also can’t help but note the irony that the block’s action largely started on Hermoine’s attempt to “self-actualize” by behaving more heroically, and ends with Dumbledore and Hariezer discussing whether it was the doing the right thing to let Hermoine play her silly little game.

Terrible things in HPMOR

  • Lack of science: I have no wrong science to complain about, because these chapters have no science references at all really.
  • the world/characters behave in silly ways as a foil for the characters: the authority figures don’t do anything to prevent the escalating bullying/conflict, aside from Snape and Quirrell who get actively involved. The bullying itself isn’t an actual social dynamic, its just general “conflict” to throw at the characters.
  • Time turner/invisibility cloak solves all problems: in a slight twist, Snape gives a time turner to a random student and uses her to pass messages to SPHEW so they can find and attack bullies
  • Superfluous retreads of previous chapters: the army battle that starts its off, much of the bullying is retread. There are several separate bully-fights in this block of chapters.
  • Horrible pacing: this whole block of chapters is a B-plot roughly the length of an entire book.
  • Stilted language. Everyone refers to the magic light sabers as “the most ancient blade” every time they reference it

Munchkinism

I’ve been meaning make a post like this for several weeks, since yxoque reminded me of the idea of the munchkin. jadagul mentioned me in a post today that reminded me I had never made it. Anyway:

I grew up playing Dungeons and Dragons, which was always an extremely fun way to waste a middle school afternoon. The beauty of Dungeons and Dragons is that it provides structure for a group of kids to sit around and tell a shared story as a group. The rules of the game are flexible, and one of the players acts as a living rule-interpreter to guide the action and keep the story flowing.

Somehow, every Dungeons and Dragons community I’ve ever been part of (middle school, highschool and college) had the same word for a particularly common failure mode of the game, and that word was munchkin, or munchkining (does anyone know if there was a gaming magazine that used this phrase?). The failure is simple - people get wrapped up in the letter of the rules, instead of the spirit, and start building the most powerful character possible instead of a character that makes sense as a role. Instead of story flow, the game gets bogged down in dice rolls and checks so that the munchkins can demonstrate how powerful they are. Particularly egregious munchkins have been known to cheat on their character creation rolls to boost all their abilities. With one particular group in highschool, I witnessed one particularly hot-headed munchkin yell at everyone else playing the game when the dungeon master (the human rule interpreter) slightly modified a rule and ended up weakening the muchkin’s character.

The frustrating thing about HPMOR is that Hariezer is designed, as yxoque pointed out, to be a munchkin- using science to exploit the rules of the magical world (which could be an interesting question), but because Yudkowsky is writing the rules of magic as he goes, Hariezer is essentially cheating at a game he is making up on the fly.

All of the cleverness isn’t really cleverness- its easy to find loopholes in the rules you yourself create as you go especially if you created them to have giant loopholes.

In Azkaban, Hariezer uses science to escape by transfiguring himself a rocket. This only makes sense because for some unknown reason magic brooms aren’t as fast as rockets.

In one of his army games, Hariezer uses gloves with gecko setae to climb down a wall, because for some reason broomsticks aren’t allowed. For some reason, there is no ‘grip a wall’ spell.

Yudkowsky isn’t bound by the handful of constraints in Rowling’s world (where Dementors represent depression, not death), hell he doesn’t even stick to his own constraints. In Hariezer’s escape from Azkaban he violates literally the only constraint he had laid down (don’t transfigure objects into something you plan to burn).

Every other problem in the story is solved by using the time turner as a deus ex machina. Even when plot constraints mean Hariezer’s time turner can’t be used, Yudkowsky just introduces another time turner rather than come up with a novel and clever solution for his characters.

Hariezer’s plans in HPMOR work only because the other characters become temporarily dumb to accommodate his “rationality” and because the magic is written around the idea of him succeeding.

"Genre savvy"

So a lot of people have asked me to take a look at the Yudkowsky writing guide, and I will eventually (first I have to finish HPMOR ,which is taking forever because I’m incredibly bored with it, but I HAVE MADE A COMMITMENT- hopefully more HPMOR live blogging after Thanksgiving).

But I did hit something that also applies to HPMOR, and a lot of other stories. Yudkowsky advocated that characters “have read the books you’ve read” so they can solve those problems. One of my anonymous asked used the phrase “genre savvy” for this- and google lead me to TV tropes page. The problem with this idea is that as soon as you insert a genre savvy character, your themes shift, much like having a character break the fourth wall. Suddenly your story is about stories. Your story is now a commentary on the genre/genre conventions.

Now, there are places where this can work fairly well- those Scream movies, for instance, were supposed to (at least in part) ABOUT horror movies as much as they WERE horror movies. Similarly, every fan-fiction is (on some level) a commentary on the original works, so “genre savvy” fan fiction self-inserts aren’t nearly as bad an idea as they could be.

HOWEVER (and this is really important)- MOST STORIES SHOULD NOT BE ABOUT STORIES IN THE ABSTRACT/GENRE/GENRE CONVENTIONS, and this means it is a terrible idea having characters that constantly approach things on a meta level “this is like in this fiction book I read” . If you don’t have anything interesting to say about the actual genre conventions, then adding a genre savvy character is almost certainly going to do you more harm then good. If you are bored with a genre convention, you’ll almost certainly get more leverage out of subverting it (if you lead both the character AND the reader to expect a zig, and instead they get a zag it can liven things up a bit) then by sticking in a genre-savvy character.

Sticking in a genre-savvy character just says “look at this silly convention!” and then when that convention is used anyway, it just feels like the writer being a lazy hipster. Sure your reader might get a brief burst of smugness “he/she’s right, all those genre books ARE stupid! Look how smart I am!” but you aren’t really moving your story forward. You are critiquing lazy conventions while also trying to use them.

If you don’t like the conventions of a genre, don’t write in that genre, or subvert them to make things more interesting. Or simply refuse to use those conventions all together, go your own way.

HPMOR 78: action without consequences

A change of tactics- this is chapter is part of another block of chapters, but I’m having trouble getting through it, so I’m going to write in installments chapter by chapter, instead of a dump on a 12 chapter block again.

This chapter is another installment of Quirrell’s battle game. This time, the parents are in the stands, which becomes important when Hermione out-magics Draco.

Afterwards, Draco is upset because his father saw him getting out magiced by a mud blood. This causes Draco, in an effort to save face or get revenge or something, to send a note to lure Hermione to meet him alone. Then, cut to the next morning- Hermione is arrested for the attempted murder of Draco. So thats it for the chapter summary.

But I want to use this chapter to touch on something that has bothered me about this story- most of the action is totally without stakes or consequences for the characters. As readers we don’t care what happens. In the case for the Quirrell battle game, the prize for victory was already handed out at the Christmas break, none of the characters have anything on the line, and the story doesn’t really act like winning or losing has real consequences for anyone involved. A lot is happening, but its ultimately boring.

The same thing happened in the anti-bullying chapters. Most of the characters being victimized lack names or personalities. Hermione and team aren’t defending characters we care about and like, they are fighting the abstract concept of bullying (and the same is largely true of Hariezer’s forays into fighting bullies.)

Part of this is because of the obvious homage to Ender’s game, without understanding Ender’s game was doing something very different- the whole point of Ender’s Game is that the series of games absolutely do feel low stakes. Even when Ender kills another kid, its largely shrugged off as Ender continuing to win (which is the first sign something a bit deeper is happening). It supposed to feel game-y so the reader rides along with Ender and doesn’t viscerally notice the genocide happening. The contrast between the real world stakes and the games being played is the point of the story. Where Ender’s game failed for me is after the battles- we don’t feel Ender’s horror at learning what happened. Sure Ender becomes speaker for the dead, but the book doesn’t make us feel Ender’s horror the same way we ride along with the game stuff. I think this is why so many people I know largely missed the point of the book and walked away with “War games are awesome!” (SCROLL DOWN FOR Fight Club FOOTNOTE THAT WAS MAKING THIS PARAGRAPH TOO LONG) But I digress- if your theme isn’t something to do with the the connection between war and games and the way people perceive violence vs games, etc, turning down the emotional stakes and the consequences for the characters make your story feel like reading a video game play-by-play, which is horribly boring.

If you cut out all the Quirell game chapters after chapter 35, no one would notice- there is nothing at stake.

ALSO- this chapter has an example of what I’ll call “DM munchkining” i.e. its easy to Munchkin when you write the rules. Hariezer is looking for powerful magic to aid him in battle, and starts reading up on potion making. He needs a way to make potions in the woods without magical ingredients, so he deduces by reading books that you don’t really need a magical ingredient, you get out from a potion ingredient what went in to making it. So Hariezer makes a potion with acorns that gets back all the light that went in to creating the acorn via photosynthesis. My point here is that this rule was created in this chapter entirely to be exploited by Hariezer in this battle. In a previous battle chapter, Hariezer exploits the fact that metal armor can block spells, a rule created specifically for that chapter to be exploited. Its not munchkining, its calvinball.

FOOTNOTE: This same problem happens with Fight Club. The tone of the movie builds up Tyler Durden as this awesome dude and the tone doesn’t shift when Ed Norton’s narrator character starts to realize how fucked everything is. So you end up with this movie thats suppose to be satirical but no one notices. They rebel against a society they find dehumanizing, BY CREATING A SOCIETY WHERE THEY LITERALLY HAVE NO NAMES, but the tone is strong enough that people are like “GO PROJECT MAYHEM! WE SHOULD START A FIGHT CLUB!”

HPMOR 79

This chapter continues on from 78. Hermione has been arrested for murder, but Hariezer now realizes in a sudden insight that she has given a false memory.

Hariezer also realizes this is how the Weasley twins planted Rita Skeeter’s false news story- they simply memory charmed Rita. Of course, this opens up more questions then it solves- if false memory charming can be done with such precision, wouldn’t there be a rash of manipulations of this type? Its such an obvious manipulation technique that chapters 24-26 with the Fred and George “caper” was written in a weirdly non-linear style to try to make it seem more mysterious.

Anyway, Hariezer tells the adults who start investigating who might have memory charmed Hermione (you’d think wizard police would do some sort of investigation but its HPMOR, so the world needs to be maximally silly as a foil to Hariezer).

And then he has a discussion with the other kids who are bad mouthing Hermione:

Professor Quirrell isn’t here to explain to me how stupid people are, but I bet this time I can get it on my own. People do something dumb and get caught and are given Veritaserum. Not romantic master criminals, because they wouldn’t get caught, they would have learned Occlumency. Sad, pathetic, incompetent criminals get caught, and confess under Veritaserum, and they’re desperate to stay out of Azkaban so they say they were False-Memory-Charmed. Right? So your brain, by sheer Pavlovian association, links the idea of False Memory Charms to pathetic criminals with unbelievable excuses. You don’t have to consider the specific details, your brain just pattern-matches the hypothesis into a bucket of things you don’t believe, and you’re done. Just like my father thought that magical hypotheses could never be believed, because he’d heard so many stupid people talking about magic. Believing a hypothesis that involves False Memory Charms is low-status.

This sort of thing bothers the hell out me. Not only is cloying elitism creeping in, but in HPMOR as in the real world, arguments regarding “status” are just thinly disguised ad-hominems. True or not true, they aren’t really attacking an argument, just the people making them.

After all, if we fall back on the “Bayesian conspiracy” confessing to a crime/having a memory of a crime is equal evidence for having done the crime and having been false memory charmed, so all the action here is in the prior. CLAIMING a false memory charm is evidence of nothing at all.

So, if the base rate of false memory charms is so low that its laughable and “low status,” then the students are correctly using Bayesian reasoning.

Although Hariezer may point out that they aren’t taking into account evidence about what sort of person Hermione is, but if the base rates of false memory charms are really so low that is unlikely to matter much- after all Hariezer doesn’t have any specific positive evidence she was false memory charmed, and she has been behaving strangely toward Draco for awhile (which Hariezer suggests is a symptom of the way the perpetrator went about the false memory charm, but could just as easily be evidence she did it- the action is still in the prior).

Similarly, his father didn’t believe in magic because it SHOULDN’T have been believed- until the story begins he has supposedly lived his whole life in our world- where magic is quite obviously not a real thing, regardless of “status. “

OF COURSE- if the world were written as a non-silly place, the base rate for false memory charms would be through the roof and everyone would say “yea, she was probably false memory charmed! Who just blurts out a confession?” and the wizard cops would just do their job.

Remember when

Way back in chapter 20 something, Quirrell gave Hariezer Roger Bacon’s magic diary, and it was going to jump start his investigation of the rules of magic? And then it was literally never mentioned again? The aptly named Checkov’s Roger Bacon’s Magi-science Diary probably applies here.

HPMOR 80

Apparently in the wizarding world, the way a trial is conducted involves a bunch of politicians voting if someone is guilty or innocent, so in this chapter the elder Malfoy uses his influence to convict Hermione. Not much to this chapter really.

BUT in some asides, we do get some flirting with neoreaction:

At that podium stands an old man, with care-lined face and a silver beard that stretches down below his waist; this is Albus Percival Wulfric Brian Dumbledore… Karen Dutton bequeathed the Line to Albus Dumbledore… each wizard passing it to their chosen successor, back and back in unbroken chain to the day Merlin laid down his life. That (if you were wondering) is how the country of magical Britain managed to elect Cornelius Fudge for its Minister, and yet end up with Albus Dumbledore for its Chief Warlock. Not by law (for written law can be rewritten) but by most ancient tradition, the Wizengamot does not choose who shall preside over its follies. Since the day of Merlin’s sacrifice, the most important duty of any Chief Warlock has been to exercise the highest caution in their choice of people who are both good and able to discern good successors.

And we get the PC/NPC distinction used by Hariezer to separate himself from the sheeple:

The wealthy elites of magical Britain have collective force, but not individual agency; their goals are too alien and trivial for them to have personal roles in the tale. As of now, this present time, the boy neither likes nor dislikes the plum-colored robes, because his brain does not assign them enough agenthood to be the subjects of moral judgment. He is a PC, and they are wallpaper.

Hermione is convicted and Hariezer is sad he couldn’t figure out something to do about it (he did try to threaten the elder Malfoy to no avail).

HPMOR 81

Our last chapter ended with Hermione in peril- she was found guilty of the attempted murder of Draco! How will Hariezer get around this one?

Luckily the way the wizard world justice system works is fucking insane- being found guilty puts Hermione in the Malfoy’s “blood debt.” So Hariezer tells Malfoy:

By the debt owed from House Malfoy to House Potter!…I’m surprised you’ve forgotten…surely it was a cruel and painful period of your life, laboring under the Imperius curse of He-Who-Must-Not-Be-Named, until you were freed of it by the efforts of House Potter. By my mother, Lily Potter, who died for it, and by my father, James Potter, who died for it, and by me, of course.

So Hariezer wants the blood debt transferred to him so he can decide Hermione’s fate (what a convenient and ridiculous way to handle a system of law and order).

But blood debts don’t transfer in this stupid world, instead you also have to pay money. So Malfoy demands something like twice the money in Hariezer’s vault. Hariezer waffles a bit, but decides to pay. Because the demand is such a large sum, this will involve going into debt to the Malfoys.

And then things get really stupid- Dumbledore says, as guardian of Hariezer’s vault he won’t let the transaction happen.

I’m - sorry, Harry - but this choice is not yours - for I am still the guardian of your vault.”

“What? ” said Harry, too shocked to compose his reply.

"I cannot let you go into debt to Lucius Malfoy, Harry! I cannot! You do not know - you do not realize -"

So… here is a question- if Hariezer is going to go into a lot of debt to pay Malfoy how does blocking him access to his money help avoid the debt? Wouldn’t Hariezer just take out a bigger loan from Malfoy?

Anyway, despite super rationality, Hariezer doesn’t think through how stupid Dumbledore’s threat is. Hariezer instead threatens to destroy Azkaban if Dumbledore won’t let him pay Malfoy, so Dumbledore relents.

Malfoy tries to weasel out of this nebulous blood debt arrangement because the rules of wizard justice change on the fly, but Hermione swears allegiance to House Potter and that prevents Malfoy’s weasel attempt.

I acknowledge the debt, but the law does not strictly oblige me to accept it in cancellation,” said Lord Malfoy with a grim smile. “The girl is no part of House Potter; the debt I owe House Potter is no debt to her…

And Hermione, without waiting for any further instructions, said, the words spilling out of her in a rush, “I swear service to the House of Potter, to obey its Master or Mistress, and stand at their right hand, and fight at their command, and follow where they go, until the day I die.”

The implications here are obvious- if you saved all of magical britain from a dark lord, and literally everyone owes you a “blood debt” you are totally above the law. Hariezer should just steal the money he owes Malfoy from some other magical families.

HPMOR 82

So the trial is wrapped up, but to finish off the section we get a long discussion between Dumbledore and Hariezer.

First, credit where credit is due there is an atypical subversion here- now its Dumbledore attempting to give a rationality lesson to Hariezer, and Hariezer agrees that he is right. Its an attempt to mix up the formula a bit, and I appreciate it even if the rest of this chapter is profoundly stupid.

So what is the rationality lesson here?

“Yes," Harry said, "I flinched away from the pain of losing all the money in my vault. But Idid it! That’s what counts! And you -” The indignation that had faltered out of Harry’s voice returned. “You actually put a price on Hermione Granger’s life, and you put it below a hundred thousand Galleons!”

"Oh?" the old wizard said softly. "And what price do you put on her life, then? A million Galleons?"

"Are you familiar with the economic concept of ‘replacement value’?" The words were spilling from Harry’s lips almost faster than he could consider them. "Hermione’s replacement value is infinite! There’s nowhere I can go to buy another one!”

Now you’re just talking mathematical nonsense, said Slytherin. Ravenclaw, back me up here?

"Is Minerva’s life also of infinite worth?" the old wizard said harshly. "Would you sacrifice Minerva to save Hermione?"

"Yes and yes," Harry snapped. "That’s part of Professor McGonagall’s job and she knows it."

"Then Minerva’s value is not infinite," said the old wizard, "for all that she is loved. There can only be one king upon a chessboard, Harry Potter, only one piece that you will sacrifice any other piece to save. And Hermione Granger is not that piece. Make no mistake, Harry Potter, this day you may well have lost your war."

Basically, the lesson is this- you have to be willing to put a value on human life, even if it seems profane. Its actually a good lesson and very important to learn. If everyone was more familiar with this, the semi frequent GOVERNMENT HEALTHCARE IS DEATH PANELS panic would never happen. Although I’d add a caveat- anyone who has worked in healthcare does this so often that we start to make a mistake the other way (forgetting that underneath the numbers are actual people).

Anyway, to justify the rationality lesson further we get a reference to some of Tetlock’s work (note: I’m unfamiliar with the work cited here, so I’m taking Yudkowsky at his word- if you’ve read the rest of my hpmor stuff, you know this is dangerous).

You’d already read about Philip Tetlock’s experiments on people asked to trade off a sacred value against a secular one, like a hospital administrator who has to choose between spending a million dollars on a liver to save a five-year-old, and spending the million dollars to buy other hospital equipment or pay physician salaries. And the subjects in the experiment became indignant and wanted to punish the hospital administrator for even thinking about the choice. Do you remember reading about that, Harry Potter? Do you remember thinking how very stupid that was, since if hospital equipment and doctor salaries didn’t also save lives, there would be no point in having hospitals or doctors? Should the hospital administrator have paid a billion pounds for that liver, even if it meant the hospital going bankrupt the next day?

To bring it home, we find out that Voldemort captured Dumbledore’s brother and demanded ransom, and Mad Eye counseled thusly

"You ransom Aberforth, you lose the war," the man said sharply. "That simple. One hundred thousand Galleons is nearly all we’ve got in the war-chest, and if you use it like this, it won’t be refilled. What’ll you do, try to convince the Potters to empty their vault like the Longbottoms already did? Voldie’s just going to kidnap someone else and make another demand. Alice, Minerva, anyone you care about, they’ll all be targets if you pay off the Death Eaters. That’s not the lesson you should be trying to teach them."

So instead of ransoming Aberforth he burned Lucious Malfoy’s wife alive (or at least convinced the death eaters that he did). That way they would think twice about targeting him.

I think the rationality lesson is fine and dandy, just one problem- this situation is not at all like the hospital administrator in the example given. The problem here is that the idea of putting a price on a human life is only a useful concept in day-to-day reality where money has some real meaning. In an actual war, even one against a sort of weird guerrilla army of dark wizards money only becomes useful if you can exchange it for more resources, and in the wizard war resources means wizards.

Ask yourself this- would a death eater target someone close to Dumbledore even if there was no possibility of ransom? OF COURSE THEY WOULD- the whole point is defeating Dumbledore, the person standing against them. Voldemort wouldn’t ask for ransom, because its a stupid thing to do- he would kill Aberforth and send pieces of him to Dumbledore by owl. This idea that ransoming makes targets of all of Dumbledore’s allies is just idiotic- they are already targets.

Next, ask yourself this- does Voldemort have any use for money? Money is an abstract, useful because we can exchange for useful things. But its pretty apparent that Voldemort doesn’t really need money- he has no problem killing, taking and stealing. The parts of magical Britain that are willing to stand up to him won’t sell his death eaters goods at any price, and the rest are so scared they’ll give him anything for free.

Meanwhile, Dumbledore is leading a dedicated resistance- basically a volunteer army. He doesn’t need to buy people’s time, they are giving it freely! Mad Eye himself notes that he could ask the Longbottoms or the Potters to empty their vaults and they would. What the resistance needs isn’t money, its people willing to fight. So in the context of this sort of war, and able fighting man like Aberforth is worth basically infinite money- money is common and useless and people willing to stand up to Voldemort are in extremely tight supply.

It would have made a lot more sense to have Voldemort ask for prisoner exchange or something like that. Aberforth in exchange for Bellatrix Black. Then both sides would be trading value for value. But then the Tetlock reference wouldn’t be nearly as on-the-nose.

At least this chapter makes clear the reason for the profoundly stupid wizard justice system and the utterly absurd blood-debt-punishment system. The whole idea was to arrange things so Hariezer could be asked to pay a ransom to Luscious Malfoy, so the reader can learn about Tetlock’s research/putting a price on lives,etc.

At least I only have like 20 chapters of this thing left.

Name of the Wind bitching

Whelp, Kvothe’s “awesomeness” has totally overwhelmed the narrative. Kvothe now has several “awesome” skills- he plays music so well that he was spontaneously given 7 talons (which is 7 times the standard Rothfuss unit for “a lot of money”). He plays music for money in a low-end tavern.

He is a journeyman artificer, which means he can make magic lamps and what not and sell them for cash. He is brilliant enough that he could easily tutor students. He has two very wealthy friends who he could borrow from.

AND YET he is constantly low on cash. To make this seem plausible, the book is weighed down by super long exposition in which Kvothe explains to the reader why all these obvious ways to make money aren’t working for him. When Kvothe isn’t explaining it to the reader directly, we cut to the framing story where Kvothe is explaining it to his two listeners. The book is chock-full of these paragraphs that are like “I know this is really stupid but here is why it actually makes sense.” Removing all this justification/exposition would probably cut the length of the book by at least 1/4.

I could look past all of this if we were meeting other interesting characters at wizard school, but that isn’t happening. Kvothe has two good friends among the students, Wil and Sim. I’ve read countless paragraphs of banter between them and Kvothe, but I don’t know what they study, or really anything about them other than one has an accent.

Another character Auri, a homeless girl who is that fantasy version of “mentally ill” that just makes people extra quirky, became friends with Kvothe off screen. Literally, we find out she exists after she has already listened to Kvothe playing music for days. She shows up for a scene then vanishes again for awhile.

And we get a love interest who mostly just banters wittily with Kvothe and then vanishes. After pages of witty banter, Kvothe will then remind the reader he is shy around women (despite, you know, having just wittily bantered for pages, because that’s how characterization works in this novel).

HPMOR bitching

Much like my previous name of the wind complaints, HPMOR is heavy with exposition- and for a similar reason. Hariezer is too “awesome” which leads to heavy-handed exposition (if for slightly different reasons than name of the wind).

The standard rule of show,don’t tell implies that the best way to teach your audience something in a narrative is to have your characters learn from experience. Your characters need to make a mistake, or have something backfire. That way they can come out the other side stronger, having learned something. If you don’t trust your audience to have gotten the lesson, you can cap it off with some exposition outlining exactly what you want to learn, but the core of the lesson should be taught by the characters experience.

But Yudkowsky inserted a Hariezer fully-equipped with the “methods of rationality.” So we get lots of episodes that set-up a conflict, and then Hariezer has a huge dump of exposition that explains why its not really a problem because rationality-judo, and the tension drains away. It would be far better to have Hariezer learn over time, so the audience can learn along with him.

So Hariezer isn’t going to grow, he is just going to exposition dump most of his problems away. We can at least watch him explore the world, right? After all, Yudkowsky has put a “real scientist” into Hogwarts so we can finally see what material you actually learn at wizard school! All that academic stuff missing from the original novels! NOPE- we haven’t had a single class in the last 60 chapters. Hariezer isn’t even learning magic in a systematic way.

I really, really don’t see what people see in this. The handful of chapters I found amusing feel like an eternity ago, it ran off the rails dozens of chapters ago! People sell the story as “using the scientific method in JK Rowling’s universe” but a more accurate description would be “it starts as using the scientific method in JK rowling’s universe, but stops doing that around chapter 25 or so. Then mostly its just about a war game, with some political maneuvering.”

HPMOR 83-84

These are just rehashes of things we’ve already been presented with (so many words, so little content). The other students still think Hermione did it (although this is written in an akward tell rather than show style- Hariezer tells Hermione what is going on, rather than Hermione or the reader experiencing it). We get gems of cloying elitism like this:

Hermione, you’ve told me a lot of times that I look down too much on other people. But if I expected too much of them - if I expected people to get things right - I really would hate them, then. Idealism aside, Hogwarts students don’t actually know enough cognitive science to take responsibility for how their own minds work. It’s not their fault they’re crazy.

There is one bit of new info- as part of this investigation of the attempted murder of Draco, I guess Quirrell was investigated, and the aurors seem to think he is some missing wizard lord or something. This is totally superfluous, I assume we all know Quirrell is Voldemort. I’m hoping this doesn’t turn into a plot line.

And finally, Quirrell tries to convince Hermione to leave and go to a wizard school where people don’t think she tried to kill someone. This is fine, but in part of it, Quirrell gives us this gem on being a hero:

Long ago, long before your time or Harry Potter’s, there was a man who was hailed as a savior. The destined scion, such a one as anyone would recognize from tales, wielding justice and vengeance like twin wands against his dreadful nemesis…

"In all honesty…I still don’t understand it. They should have known that their lives depended on that man’s success. And yet it was as if they tried to do everything they could to make his life unpleasant. To throw every possible obstacle into his way. I was not naive, Miss Granger, I did not expect the power-holders to align themselves with me so quickly - not without something in it for themselves. But their power, too, was threatened; and so I was shocked how they seemed content to step back, and leave to that man all burdens of responsibility. They sneered at his performance, remarking among themselves how they would do better in his place, though they did not condescend to step forward.”

So… the people seem mostly to rally around Dumbledore. He has a position of power and influence because of his dark-wizard vanquishing deeds. There aren’t a lot of indications people are actively attempting to make Dumbledore’s life unpleasant, he has the position he wants, turned down the position of minister of magic,etc. People are mostly in awe of Dumbledore.

But there is some other hero, we are supposed to believe, who society mocked? I can’t help but draw parallels to Friendly AI research here…

HPMOR 85

A return to my blogging obsession of old (which has been a slog for at least 20 chapters now, but if there is one thing that is true of all phds- we finish what we fucking start, even if it’s an awful idea).

This chapter is actually not so bad, mostly Hariezer just reflecting on the difficulty of weighing his friends lives against “the cause” as Dumbledore suggested he failed to do with Hermione in her trial a few chapters ago.

There are some good bits. For instance, this interesting bit about bow and arrow’s in Australia:

A year ago, Dad had gone to the Australian National University in Canberra for a conference where he’d been an invited speaker, and he’d taken Mum and Harry along. And they’d all visited the National Museum of Australia, because, it had turned out, there was basically nothing else to do in Canberra. The glass display cases had shown rock-throwers crafted by the Australian aborigines - like giant wooden shoehorns, they’d looked, but smoothed and carved and ornamented with painstaking care. In the 40,000 years since anatomically modern humans had migrated to Australia from Asia, nobody had invented the bow-and-arrow. It really made you appreciate how non-obvious was the idea of Progress.

I always thought the fact that Australians (and lot of small islanders) lost the bow and arrow (which is interesting! They had it and then they forgot about it!) was an interesting observation about the power of sharing ideas and the importance of large groups for creativity. Small, isolated populations seem to lose the ability to innovate. Granted, almost all of my knowledge about this comes from one anthropology course I only half remember.

And of course there are always some sections that filled me with rage-

Even though Muggle physics explicitly permitted possibilities like molecular nanotechnology or the Penrose process for extracting energy from black holes, most people filed that away in the same section of their brain that stored fairy tales and history books, well away from their personal realities:

Molecular nanotechnology is just the words that sci-fi authors (and Eric Drexler) use for magic. And the nearest black hole is probably something like 2000 light years away. The reason people treat this stuff as far from their personal reality is exactly the same reason Yudkowsky treats it as far from his personal reality- IT IS. Black holes are neat, and GR is a ton of fun, but we aren’t going to be engineering with black holes in my lifetime.

No surprise, then, that the wizarding world lived in a conceptual universe bounded - not by fundamental laws of magic that nobody even knew - but just by the surface rules of known Charms and enchantments…Even if Harry’s first guess had been mistaken, one way or another it was still inconceivable that the fundamental laws of the universe contained a special case for human lips shaping the phrase ‘Wingardium Leviosa’. …What were theultimate possibilities of invention, if the underlying laws of the universe permitted an eleven-year-old with a stick to violate almost every constraint in the Muggle version of physics? You know what would be awesome? IF YOU GOT AROUND TO DOING SOME EXPERIMENTS AND EXPLORING THIS IDEA. The absolute essence of science is NOT asking these questions, it’s deciding to try to find out the fucking answers! You can’t be content to just wonder about things, you have to put the work in! Hariezer’s wonderment never gets past the stoned-college-kid wondering aloud and into ACTUAL exploration, and its getting really frustrating. YOU PROMISED ME YOU WERE GOING TO USE THE SCIENTIFIC METHOD TO LEARN THE SECRETS OF MAGIC. WAY BACK IN THE EARLY CHAPTERS.

Anyway, towards the end of the ruminations, Fawkes visits Hariezer and basically offers to take him to Azkaban to try to take out the evil place. Hariezer (probably wisely) decides not to go. And the chapter ends.

HPMOR 86

I just realized I have like 145 followers (HOLY SHIT!) and they probably came for the HPMOR thing. So I better keep the updates rolling!

Anyway, this chapter is basically Hariezer and friends (Dumbledore, Snape, Mcgonagall, Mad-eye Moody) all trying to guess who might have been responsible for trying to frame Hermione. No real conclusions are drawn, not much to see her.

A few notable things here- magic apparently works by the letter of the law, rather than the spirit:

You say someone with the Dark Mark can’t reveal its secrets to anyone who doesn’t already know them. So to find out how the Dark Mark operates, write down every way you can imagine the Dark Mark might work, then watch Professor Snape try to tell each of those things to a confederate - maybe one who doesn’t know what the experiment is about - I’ll explain binary search later so that you can play Twenty Questions to narrow things down - and whatever he can’t say out loud is true. His silence would be something that behaves differently in the presence of true statements about the Mark, versus false statements, you see.

Luckily, Voldemort thought of the test, thus freeing Snape to tell how the mark actually works:

The Dark Lord was no fool, despite Potter’s delusions. The moment such a test is suspected, the Mark ceases to bind our tongues. Yet I could not hint at the possibility, but only wait for another to deduce it.

Why not just make sure the death eaters don’t actually know the secrets of the mark? Seems like memory spells are everywhere already, and it would be way easier than this silly logic puzzle.

Finding out the secrets of the dark mark prompts Hariezer to try a Bayesian estimate of whether Voldemort is actually dangerous. I repeat that for emphasis:

Harry Potter, first year of Hogwarts who has only really succeeded at 1 thing in his learn-the-science-of-magic plan (partial transfiguration), and who knows he is not the most dangerous wizard at Hogwarts (Quirrel, Dumbledore), wonders whether Voldemort could possibly be a threat.

Here are some of the things he considers:

Harry had been to a convocation of the Wizengamot. He’d seen the laughable ‘security precautions’, if you could call them that, guarding the deepest levels of the Ministry of Magic. They didn’t even have the Thief’s Downfall which goblins used to wash away Polyjuice and Imperius Curses on people entering Gringotts … [if it] took you more than ten years to fail to overthrow the government of magical Britain, it meant you were stupid. But might they have some other precautions? Maybe they use some sort of secret precautions Harry himself doesn’t yet know about yet? Or might the wizards of the Wizengamot be pretty powerful in their own right?

There were hypotheses where the Dark Lord was smart and the Order of the Phoenixdidn’t just instantly die, but those hypotheses were more complicated and ought to get complexity penalties. After the complexity penalties of the further excuses were factored in, there would be a large likelihood ratio from the hypotheses ‘The Dark Lord is smart’ versus ‘The Dark Lord was stupid’ to the observation, ‘The Dark Lord did not instantly win the war’. That was probably worth a 10:1 likelihood ratio in favor of the Dark Lord being stupid… but maybe not 100:1. You couldn’t actually say that ‘The Dark Lord instantly wins’ had a probability of more than 99 percent, assuming the Dark Lord started out smart; the sum over all possible excuses would be more than .01.

Dude, do you even Bayesian? Probability the dark mark still works if Voldemort is dead. ~0 (everyone who knows magic thinks that the mark still existing is proof he is still out there). Given that Voldemort is alive, probability he successfully complete some sort of immortality ritual ~1. Probability someone who completed an immortality ritual knows more magic than (and therefore is a threat to) Hariezer Yudotter ~1.

So given that the dark mark is still around, Voldie is crazy dangerous, regardless of priors or base rates.

It’s helpful to look at where the information is, instead of trying to estimate the probability Voldemort could have instantly killed some of the most powerful wizards on the fucking planet.

Anyway….

OH, another thing that happens- Hariezer challenges Mad-eye to a little mini duel. Guess how he solves the problem of winning against Mad-eye? Any ideas? What could he use? I’ll give you a hint, it rhymes with time turner. This story really should be called Harry Potter And The Method of Time Turners. Seriously- time turners solve basically all the problems in this book. Anyway, he goes to Flitwick, learns a bending spell, and then time turners back into the room to pop Moody.

It’s not actually a bad scene, there is a bit of action and it moves pretty quickly. The problem is that the time turner solution is so damn boring at this point.

Also, we find out in this chapter that everyone believes Quirrell is really somebody named David Monroe whose family was killed by Voldemort and who was a leader during the war against Voldemort.

So we have some potential possibilities-

  1. Voldemort was impersonating/half-invented the personality of David Monroe in order to play both sides during the war. After all, all of Monroe’s family was killed but him. Maybe all of Monroe’s family was killed, including him, and Voldemort started impersonating the dead guy. This could be a neat dynamic I guess. Could “Voldemort” have been a Watchmen style plan to unite magical Britain against a common threat that went awry for Monroe/Riddle? Quirrell really did get body snatched in this scenario. We could imagine an ending here where Monroe/Riddle are training Potter to be the leader of magical Britain that Monroe/Riddle wanted to be.

  2. Monroe was a real dude, Voldemort body-snatched him, and now you’ve got Monroe brain fighting Voldemort brain inside. For some reason, they are impersonating Quirrell?

If its not the first scenario, I’m going to be sort of annoyed, because scenario 2 doesn’t provide us with much reason for the weird Monroe bit- you could just give Quirrell all of Monroe’s backstory.

Anyway, 86 chapters to go, I think this damn thing is going to clock in around 120 when all is said and done. ::sigh:: Time for a scotch.

HPMOR 87: skinner box your friends

Hariezer is worried Hermione will be uncomfortable around him after the trial. So what is his solution?

"I was going to give you more space," said Harry Potter, "only I was reading up on Critch’s theories about hedonics and how to train your inner pigeon and how small immediate positive and negative feedbacks secretly control most of what we actually do, and it occurred to me that you might be avoiding me because seeing me made you think of things that felt like negative associations, and I really didn’t want to let that run any longer without doing something about it, so I got ahold of a bag of chocolates from the Weasley twins and I’m just going to give you one every time you see me as a positive reinforcement if that’s all right with you -"

Now, this idea of positive/negative reinforcement is an old one, and goes back to probably the psychologists associated with behaviorism (BF Skinner, Pavlov, etc).

The weird thing is, there is no “Critch” I can find associated with the behaviorists, or really any of the stuff attributed above. I also emailed my psych friend, who also has never heard of it (but “it’s not really my field at all”). I’m thinking there is like a 90% chance that Yudkowsky just invented a scientist here? Why not just say BF Skinner, or Pavlov here? WHAT IS GOING ON HERE?

Anyway, Hermione and Hariezer are brainstorming ways to make money when they get into an argument because Hariezer has been sciencing with Draco:

"You were doing SCIENCE with him? " "Well -" "You were doing SCIENCE with him? You were supposed to be doing science with ME! " Hermione, I get it. You wanted to figure out how magic works you’ve got some curiosity about the world. And now you think Hariezer kept his program going, but cut you out of the big discoveries, will leave you off the publications. But I’ve got news for you, girl, he hasn’t been doing science WITH ANYONE for like 60 chapters now. HE JUST FORGOT ABOUT IT.

Anyway, this argument blows up, and Hariezer explains puberty:

But even with all that weird magical stuff letting me be more adult than I should be, I haven’t gone through puberty yet and there’s no hormones in my bloodstream and my brain is physically incapable of falling in love with anyone. So I’m not in love with you! I couldn’t possibly be in love with you!

And then drops some evopsych

and besides I’ve been reading about evolutionary psychology, and, well, there are all these suggestions that one man and one woman living together happily ever afterward may be more the exception rather than the rule, and in hunter-gatherer tribes it was more often just staying together for two or three years to raise a child during its most vulnerable stages - and, I mean, considering how many people end up horribly unhappy in traditional marriages, it seems like it might be the sort of thing that needs some clever reworking - especially if we actually do solve immortality To the story’s credit, this works about as well as you’d expect and Hermione storms off.

I think the evopsych dropping could have been sort of funny if it were played more for laughs (Hariezer’s inept way of calming Hermione down), but here it just seems like a way to shoehorn this bit of evopsych into the story.

The final scene in the chapter is played for laughs, with another student coming over after seeing Hermione storm off and saying “Witches! go figure, huh?”

HPMOR 88: in which I complain about a lack of time turners

The problem with solving every problem in your story with time turners is that it becomes incredibly conspicuous when you don’t solve a problem with time turners.

In this chapter, the bit of canon from book 1 with the troll in the dungeon is introduced- someone comes running into the dining hall yelling troll. Luckily, Quirrell has the students well prepared:

Without any hesitation, the Defense Professor swung smoothly on the Gryffindor table and clapped his hands with a sound like a floor cracking through. "Michelle Morgan of House Gryffindor, second in command of Pinnini’s Army," the Defense Professor said calmly into the resulting quiet. "Please advise your Head of House." Michelle Morgan climbed up onto her bench and spoke, the tiny witch sounding far more confident than Minerva remembered her being at the start of the year. “Students walking through the hallways would be spread out and impossible to defend. All students are to remain in the Great Hall and form a cluster in the center… not surrounded by tables, a troll would jump right over tables… with the perimeter defended by seventh-year students. From the armies only, no matter how good they are at duelling, so they don’t get in each other’s lines of fire.”

So everyone will be safe from troll, but WAIT- Hariezer realizes Hermione is missing. What does he do? Does he commit himself to time turning himself a message telling him where Hermione is (to be fair, the time is noon, and the earliest he can reach with a time turner is 3pm. However he knows of another student who uses a time turner and is willing to send messages with it, from the post Azkaban escape. He also knows other powerful wizards use time turners, so he could ask one of them to pass the message,etc).

I suspect we are approaching an important plot moment that time turnering would somehow break. Maybe we finally get a Quirrell reveal? Anywho, it’s jarring to not see Hariezer go immediately for the time turner. Instead he tries to enlist the aid of other students (and not ask if anyone has a time turner).

Anyway, Hariezer decides they need to go look for her as fast as possible- but then

The witch he’d named turned from where she’d been staring steadily out at the perimeter, her expression aghast for the one second before her face closed up. "The Deputy Headmistress ordered us all to stay here, Mr. Potter." It took an effort for Harry to unclench his teeth. “Professor Quirrell didn’t say that and neither did you. Professor McGonagall isn’t a tactician, she didn’t think to check if we had missing students and she thought it was a good idea to start marching students through the hallways. But Professor McGonagall understands after her mistakes are pointed out to her, you saw how she listened to you and Professor Quirrell, and I’m certain that she wouldn’t want us to just ignore the fact that Hermione Granger is out there, alone -“

So Hariezer flags this as

Harry’s brain flagged this as I’m talking to NPCs again and he spun on his heel and dashed back for the broomstick.

Yes, Hariezer, in this world you are talking to NPCs- characters Yudkowsky wrote in, entirely to be stupid so that you can appear brilliant.

Anyway, he rushes off the Weasley twins to go find Hermione, and just as he finds her the chapter ends. I look forward to tuning in next time for the thrilling conclusion.

HPMOR 89: grimdark

There will be spoilers ahead. Although if you cared about spoilers why are you reading this?

So I thought the plot moment we were leading up to was a Quirrell reveal and I was dead wrong (a pun, because Hermione dies). By the time Hariezer arrives, Hermione has already been troll smashed (should have used the time turner,bro).

A brief battle scene ensues in which the Weasleys fail to be very effective, andHariezer kills the troll by floating his father’s rock (which he has been wearing in a ring) into the trolls mouth and then letting it go back to its original size, which pops the troll head.

Hermione then utters her final words “not your fault” and then dies. Hariezer is obviously upset by this.

Not a bad chapter really, even though it required a sort “rationality failure” involving the time turners to get here. Normally I wouldn’t care about this sort of thing, but the fact that basically every problem thus far was solved with time turners makes it very hard to suspend my disbelief here. It feels a touch to much like characters are doing things just to make the plot happen (and not following their ‘natural’ actions).

I fear the next ten chapters will be just reflections on this (instead of things happening).

HPMOR 90: Hariezer's lack of self reflection

Brief note- it’s mardi gras, and I’m about as over served as I ever have been. I LIKE HOW OVER SERVED AS A PHRASE BLAMES THE BARTENDER AND NOT ME. THIS IS A THEME FOR THIS CHAPTER. Anyway, hopefullly this will not lack my usual (non) eloquence.

This chapter begins what appears to be a 9 part section on Hariezer trying to cope with the death of his friend.

As the chapter opens, Hariezer cools Hermione’s body to try to preserve it. I guess that will slow decay, but probably not by enough to matter.

And then Hariezer gets understandably mopey. Everyone is concerned he is withdrawing from the world, so Mcgonagall goes to talk to him and we get this bit:

"Nothing I could have done? " Harry’s voice rose on the last word. "Nothing I could have…Or if I’d just approached the whole problem from a different angle - if I’d looked for a student with a Time-Turner to send a message back in time..

It’s the one in bold that is especially troubling because the time turner is seriously what Hariezer always turns to (TURNS TO! GET IT! IT’S AN AWFUL PUN). When your character is defined by his munchkining ability to solve problems via time turner, and the one time he doesn’t go for the time turner a major plot point happens, it’s jarring to the reader. Almost as if characters are behaving entirely to make the plot happen…

Anyway,

She was aware now that tears were sliding down her cheeks, again. “Harry - Harry, you have to believe that this isn’t your fault!” "Of course it’s my fault. There’s no one else here who could be responsible for anything." "No! You-Know-Who killed Hermione!" She was hardly aware of what she was saying, that she hadn’t screened the room against who might be listening. "Not you! No matter what else you could’ve done, it’s not you who killed her, it was Voldemort! If you can’t believe that you’ll go mad, Harry!" "That’s not how responsibility works, Professor." Harry’s voice was patient, like he was explaining things to a child who was certain not to understand. He wasn’t looking at her anymore, just staring off at the wall to her right side. "When you do a fault analysis, there’s no point in assigning fault to a part of the system you can’t change afterward

So keep this in mind- Hariezer says it’t no use blaming anyone but himself, because he can’t change their actions. This seems like a silly NPC/PC distinction- no one can change their past actions, but everyone can learn how they could have improved things.

"All right, then," Harry said in a monotone. "I tried to do the sensible thing, when I saw Hermione was missing and that none of the Professors knew. I asked for a seventh-year student to go with me on a broomstick and protect me while we looked for Hermione. I asked for help. I begged for help. And nobody helped me. Because you gave everyone an absolute order to stay in one place or they’d be expelled, no excuses…. So when something you didn’t foresee happened and it would’ve made perfect sense to send out a seventh-year student on a fast broom to look for Hermione Granger, the students knew you wouldn’t understand or forgive. They weren’t afraid of the troll, they were afraid of you. The discipline, the conformity, the cowardice that you instilled in them delayed me just long enough for Hermione to die. Not that I should’ve tried asking for help from normal people, of course, and I will change and be less stupid next time. But if I were dumb enough to allocate responsibility to someone who isn’t me, that’s what I’d say."

What exactly does Hariezer think she should have said here? If a fire had broken out in the meal hall does Hariezer think that everyone would have stayed in the cafeteria and burned to death out of fear of Mcgonagall? Also, it certainly sounds as if Hariezer has plenty of blame for people not himself. ”I only blame me, but also you suck in the following ways…”

But normal people don’t choose on the basis of consequences, they just play roles. There’s a picture in your head of a stern disciplinarian and you do whatever that picture would do, whether or not it makes any sense….People like you aren’t responsible for anything, people like me are, and when we fail there’s no one else to blame.” I AM THE ONLY PC, YOU ARE ALL NPC. I AM THE ONLY FULL HUMAN. TREMBLE BEFORE MY AGENTYNESS. I get that Harizer is mourning, but is their any more condescending way to mourn? ”Everything is my fault because you aren’t all even fully human?” You are a fucking twerp Hariezer, even when you mourn.

His hand darted beneath his robes, brought forth the golden sphere that was the Ministry-issued protective shell of his Time Turner. He spoke in a dead, level voice without any emphasis. “This could’ve saved Hermione, if I’d been able to use it. But you thought it was your role to shut me down and get in my way. No, Hariezer, you were told THERE WERE RULES and you violated them. You yourself have said that time travel can be dangerous and you were using it because Snape asked questions you didn’t know the answer to, and really to solve any trivial problem. You broke the rules, and it locked your time turner down when you might have really wanted it. Total boy-who-cried-wolf situation, and yet its conspicuously absent from your discussion above- you blame yourself in lots of ways, but not in this way.

Unable to speak, she brought forth her wand and did so, releasing the time-keyed enchantment she’d laced into the shell’s lock.

The only lessons learned from this are other character “updating towards” the idea that Hariezer Yudotter is always right. And he fails when other people have prevented his natural PC based awesomenes.

Anyway, Mcgonagall sends in the big guns (Quirrell) to try to talk to Hariezer, which leads Hariezer to say to him:

The boy’s voice was razor-sharp. “I’d forgotten there was someone else in Hogwarts who could be responsible for things.”

And later in the conversation:

"You did want to save her. You wanted it so strongly that you made some sort of actual effort. I suppose your mind, if not theirs, would be capable of that."

So you see- it’s clearly not about assigning himself all the blame (because he can only change his own actions), it’s about separating the world into ‘real people’ and ‘NPCs’ Only real people can get any blame for anything, everyone else is just window dressing. Maybe it’s a pet peeve, but I react in abhorrence to this “you aren’t even human enough to share some blame” schtick.

8 more chapters in this fucking section.

HPMOR 91

Total retread of the last chapter. Hariezer is still blaming himself, Snape tries to talk to him. They bring his parents in to try to talk to him. Nothing here really.

HPMOR 92

Really, still nothing here. Quirrell is also concerned about Hariezer, but as before his concern seems less than totally genuine. I fear this arc is basically just a lot of retreads.

HPMOR 93

Still very little going on in these chapters…

So Mcgonagall completes the transformation she began two chapters ago, and realizes rules are for suckers and Hariezer is always right

"I am ashamed," said Minerva McGonagall, "of the events of this day. I am ashamed that there were only two of you. Ashamed of what I have done to Gryffindor. Of all the Houses, it should have been Gryffindor to help when Hermione Granger was in need, when Harry Potter called for the brave to aid him. It was true, a seventh-year could have held back a mountain troll while searching for Miss Granger. And you should have believed that the Head of House Gryffindor," her voice broke, "would have believed in you. If you disobeyed her to do what was right, in events she had not foreseen. And the reason you did not believe this, is that I have never shown it to you. I did not believe in you. I did not believe in the virtues of Gryffindor itself. I tried to stamp out your defiance, instead of training your courage to wisdom.

Maybe I’m projecting too much of canon McGonagall onto my reading of this one in this fanfic, but has she really been stamping out all defiance and overly stern? Would any student really have believed they would have expelled for trying to help find a missing student in a dire situation?

Hariezer certainly wasn’t expelled (or punished in anyway) for his experimenting with transfiguration/discovering partial transfiguration. He was punished for flaunting his time turner despite explicit instructions not to… But in a school for magic users, that is probably a necessity.

Also, Hermione’s body has gone missing. I suspect Hariezer is cryonicsing it.

HPMOR 94

This is the best chapter of this “reflect on what just happened” giant block of chapters, but that’s not saying much.

Hariezer might not have taken Hermione’s body, but seems unconcerned that it’s missing (maybe he took it to freeze the brain, maybe Voldie took it to resurrect Hermione or brain upload her or something). That’s the only real thing of merit that happens in this chapter (a conversation between Dumbledore and Hariezer, a conversation between Neville and Hariezer).

Hariezer has finally convinced himself that Voldemort is smart, which leads to this rumination

Okay, serious question. If the enemy is that smart, why the heck am I still alive? Is it seriously that hard to poison someone, are there Charms and Potions and bezoars which can cure me of literally anything that could be slipped into my breakfast? Would the wards record it, trace the magic of the murderer? Could my scar contain the fragment of soul that’s keeping the Dark Lord anchored to the world, so he doesn’t want to kill me? Instead he’s trying to drive off all my friends to weaken my spirit so he can take over my body? It’d explain the Parselmouth thing. The Sorting Hat might not be able to detect a lich-phylactery-thingy. Obvious problem 1, the Dark Lord is supposed to have made his lich-phylactery-thingy in 1943 by killing whatshername and framing Mr. Hagrid. Obvious problem 2, there’s no such thing as souls.

So, all the readers are already on board this train, because they’ve read the canon novel, so I guess it’s nice that the “super rationalist” is considering it (although Voldemort is smart, therefore I have a Voldemort fragment trying to possess me is a huge leap. You didn’t even Bayes that shit bro).

But seriously, “there’s no such thing as souls?” SO DON’T CALL IT A SOUL, CALL IT A MAGIC RESURRECTION FRAGMENT. Are we really getting hung up on semantics?

These chapter are intensely frustrating because any “rising action” in this story (we are nearing the conclusion after all) is blunted because after anything happens, we need 10 chapters for everyone to talk about everything and digest the events. The ratio of words/plot is ridiculously huge.

We do maybe get a bit of self-reflection when Neville tries to blame himself for Hermione’s death:

"Wow," the empty air finally said. "Wow. That puts a pretty different perspective on things, I have to say. I’m going to remember this the next time I feel an impulse to blame myself for something. Neville, the term in the literature for this is ‘egocentric bias’, it means that you experience everything about your own life but you don’t get to experience everything else that happens in the world. There was way, way more going on than you running in front of me. You’re going to spend weeks remembering that thing you did there for six seconds, I can tell, but nobody else is going to bother thinking about it. Other people spend a lot less time thinking about your past mistakes than you do, just because you’re not the center of their worlds. I guarantee to you that nobody except you has even consideredblaming Neville Longbottom for what happened to Hermione. Not for a fraction of a second. You are being, if you will pardon the phrase, a silly-dilly. Now shut up and say goodbye."

It would be nice for Hariezer to more explicitly use this to come to terms with his own grieving (instead of insisting on “heroic responsibility” for himself a few sections back, and also insisting it’s McGonagall’s fault for trying to enforce rules, and now insisting that blaming yourself is egocentric bias). I hope this is Hariezer realizing that he shouldn’t blame himself, and growing a bit, but fear this is Hariezer suggesting that Neville isn’t important enough to blame.

Anyway, Hariezer insists that Neville leave for awhile to help keep him safe.

HPMOR 95

So the chapter opens with more incuriousness, which is the rest of the chapter in miniature:

Harry had set the alarm upon his mechanical watch to tell him when it was lunchtime, since he couldn’t actually look at his wrist, being invisible and all that. It raised the question of how his eyeglasses worked while he was wearing the Cloak. For that matter the Law of the Excluded Middle seemed to imply that either the rhodopsin complexes in his retina were absorbing photons and transducing them to neural spikes, or alternatively, those photons were going straight through his body and out the other side, but not both. It really did seem increasingly likely that invisibility cloaks let you see outward while being invisible yourself because, on some fundamental level, that was how the caster had - not wanted - but implicitly believed - that invisibility should work.

This would be an excellent fucking question to explore, maybe via some experiments. But no. I’ve totally given up on this story exploring the magic world in any detail at all. Anyway, Hariezer skips straight from “I wonder how this works” to “it must work this way, how could we exploit it”

Whereupon you had to wonder whether anyone had tried Confunding or Legilimizing someone into implicitly and matter-of-factly believing that Fixus Everythingus ought to be an easy first-year Charm, and then trying to invent it. Or maybe find a worthy Muggleborn in a country that didn’t identify Muggleborn children, and tell them some extensive lies, fake up a surrounding story and corresponding evidence, so that, from the very beginning, they’d have a different idea of what magic could do.

This skips all the interesting hard work of science.

The majority of the chapter is a long discussion between Quirrell and Hariezer where Quirrell tries to convince Hariezer not to try to raise the dead. It’s too dangerous, may end the universe, etc.

Lots of discussion about how special Quirrell and Hariezer are because only they would even think to fight death,etc. It’s all a boring retread of ideas already explored in earlier chapters,etc.

It reads a lot like any discussion of cryonics with a cryonics true believer:

The Defense Professor’s voice was also rising. “The Transfiguration Professor is reading from a script, Mr. Potter! That script calls for her to mourn and grieve, that all may know how much she cared. Ordinary people react poorly if you suggest that they go off-script. As you already knew!”

Also, it’s sloppy world building- do we really think no wizards in the HPMOR universe have spent time investigating death/spells to reverse aging/spells to deal with head injuries,etc?

THERE IS A RESURRECTION STONE AND A LITERAL GATEWAY TO THE AFTERLIFE IN THE BASEMENT OF THE MINISTRY OF MAGIC. Maybe Hariezer’s FIRST FUCKING STOP if he wanted to investigate bringing back the dead SHOULD BE THAT GATE. Maybe some scientific experiments?

It’s like the above incuriousness with the invisibility cloak (and the typical transhumanist approach to science)- assume all the problems are solved and imagine what the world be like, how dangerous that power might be. This is no way to explore a question. It’s not even producing a very interesting story.

Quirrell assumes Hariezer might end the world even though he has shown 0 aptitude with any magic even approaching dangerous…

HPMOR 96: more of the same

Remus takes Hariezer to Godric’s Hollow to try to cheer him up or whatever.

Hariezer discovers the Potter’s family motto is apparently the passage from Corinthians:

The Last Enemy to Be Destroyed is Death

Hariezer is super glad that his family has a long history of trying to end death, and (at least) realizes that other wizards have tried. Of course, the idea of actually looking at their research doesn’t fucking occur to him because this story is very silly.

We get this rumination from Hariezer on the Peverell’s ‘deathly hallows’ from the books:

Hiding from Death’s shadow is not defeating Death itself. The Resurrection Stone couldn’t really bring anyone back. The Elder Wand couldn’t protect you from old age.

HOW THE FUCK DO YOU KNOW THE RESURRECTION STONE CAN’T BRING ANYONE BACK? HAVE YOU EVEN SEEN IT?

Step 1- assume that the resurrection stone doesn’t work because you can’t magically bring back the dead

Step 2- decide you want to magically resurrect the dead

Step 3- never revisit step 1.

SCIENCE!

GO INVESTIGATE THE DOORWAY TO THE AFTERLIFE! GO TALK TO PEOPLE ABOUT THE RESURRECTION STONE! DO SOME FUCKING RESEARCH! ”I’m going to resurrect the dead by thinking really hard about how much death sucks and doing nothing else.”

HPMOR 97: plot points resolved arbitrarily

Next on the list to talk with Hariezer regarding Hermione’s death? The Malfoys who call Hariezer to Gringotts under the pretense of talking about Hariezer’s debt.

On the way in he passes a goblin, which prompts this

If I understand human nature correctly - and if I’m right that all the humanoid magical species are genetically human plus a heritable magical effect -

How did you come to that conclusion Hariezer? What did you do to study it? Did you just make it up with no justification whatsoever? This story confuses science jargon for science.

Anyway, Lucious is worried that he’ll be blamed for Hermione’s death (although given that it has already been established that the wizard court votes exactly as he wants it to I’m not sure why he is worried about it) so he agrees to cancel Hariezer’s debt and return all his money if Hariezer swears Lucious didn’t have anything to do with the troll that killed Hermione.

This makes very little sense- why would anyone listen to Hariezer on this? Hariezer doesn’t actually know that the Malfoys weren’t involved. If he is asked “how do you know?” he’ll have to say “I don’t.” If he Bayesed that shit, the Malfoys should be near the fucking top of the suspect list…

Anyway, the Malfoys try to convince Hariezer that Dumbledore killed Hermione as some sort of multi-level plot.

I’m so bored.

HPMOR 98: this block is nearly over!

The agreement put in place in the previous chapter is enacted.

Malfoy announces to Hogwarts that Hermione was innocent. Hariezer says there is no ill will between the Potters and the Malfoys. Why did we even need this fucking scene?

Through backstage maneuvering by Hariezer and Malfoy,the Hogwarts board of governors enacts some rules for safety of students (travel in packs, work together,etc). Why they needed the maneuvering I don’t know (just ask McGonagall to implement whatever rules you want. No effort required).

Also, Neville was sent away from Hogwarts like.. three chapters ago. But now he is in Hogwarts and stands up to read some of the rules? And Draco, who was closer to Hariezer, returns to Hogwarts? This makes no sense given Hariezer’s fear for his friends? ”No one is safe! Wait, I changed my mind even though nothing has happened.”

There was also a surreal moment where the second worst thing I’ve ever read referenced the first:

"Remind me to buy you a copy of the Muggle novel Atlas Shrugged,"

HPMOR 99

This chapter is literally one sentence long. Unicorn died at Hogwarts. Why not just slap it into the previous chapter?

HPMOR 100

Remember that mysterious bit about the unicorns dying? That merited a whole one-sentence chapter? Luckily, it’s totally resolved in this chapter.

Borrowing a scene from canon, we have Draco and some slytherin palls (working to fix the school) investigating the forest with Hagrid as part of a detention. This leads to a variant of an old game theory/cs joke:

Meself,” Hagrid continued, “I think we might ‘ave a Parisian hydra on our ‘ands. They’re no threat to a wizard, yeh’ve just got to keep holdin’ ‘em off long enough, and there’s no way yeh can lose. I mean literally no way yeh can lose so long’s yeh keep fightin’. Trouble is, against a Parisian hydra, most creatures give up long before. Takes a while to cut down all the heads, yeh see.” "Bah," said the foreign boy. "In Durmstrang we learn to fight Buchholz hydra. Unimaginably more tedious to fight! I mean literally, cannot imagine. First-years not believe us when we tell them winning is possible! Instructor must give second order, iterate until they comprehend."

This time, it’s just Draco and friends in detention, no Hariezer/

When Draco encounters the unicorn killer, all of a sudden Hariezer and aurors come riding in to save the day:

After Captain Brodski had learned that Draco Malfoy was in the Forbidden Forest, seemingly in the company of Rubeus Hagrid, Brodski had begun inquiring to find out who had authorized this, and had still been unable to find out when Draco Malfoy had missed check-in. Despite Harry’s protests, the Auror Captain, who was authorized to know about Time-Turners, had refused to allow deployment to before the time of the missed check-in; there were standard procedures involving Time. But Brodski had given Harry written orders allowing him to go back and deploy an Auror trio to arrive one second after the missed check-in time.

So… why does Hariezer come with the aurors? For what purpose? He is always talking about avoiding danger,etc so why ride into danger when the battle wizards will probably be enough?

Anyway, we all know its Quirrell killing unicorns, so I’ll skip to the Hariezer/Quirrell interaction:

The use of unicorn’s blood is too well-known.” "I don’t know it," Harry said. "I know you do not," the Defense Professor said sharply. "Or you would not be pestering me about it. The power of unicorn’s blood is to preserve your life for a time, even if you are on the very verge of death."

And then

"And why -" Harry’s breath hitched again. "Why isn’t unicorn’s blood standard in healer’s kits, then? To keep someone alive, even if they’re on the very verge of dying from their legs being eaten?" "Because there are permanent side effects," Professor Quirrell said quietly. "Side effects? Side effects? What kind of side effect is medically worse than DEATH? " Harry’s voice rose on the last word until he was shouting. "Not everyone thinks the same way we do, Mr. Potter. Though, to be fair, the blood must come from a live unicorn and the unicorn must die in the drinking. Would I be here otherwise?" Harry turned, stared at the surrounding trees. “Have a herd of unicorns at St. Mungos. Floo the patients there, or use portkeys.” "Yes, that would work."

So do you remember a few chapters back when Hariezer was worried about eating plants or animals that might be conscious (after he learned snake speech)?

He knows literally nothing about unicorns here, nothing about what the side effects are,etc. I know lots of doctors who have living wills because they aren’t ok with the side effects of certain life-preserving treatments.

This feels again like canon is fighting the transhumanist message the author wants to insert.

HPMOR 101

Still in the woods, Hariezer encounters a centaur who tries to kill him, because he divines that Hariezer is going to make all the stars die.

There are some standard anti-astrology arguments, which again seems to be fighting the actual situation because the centaurs successfully use astrology to divine things.

We get this:

"Cometary orbits are also set thousands of years in advance so they shouldn’t correlate much to current events. And the light of the stars takes years to travel from the stars to Earth, and the stars don’t move much at all, not visibly. So the obvious hypothesis is that centaurs have a native magical talent for Divination which you just, well, project onto the night sky."

There are so, so many other hypothesis Hariezer. Maybe starlight has a magical component that waxes and wanes as stars align into different magical symbols are some such. The HPMOR scientific method:

observation -> generate 1 hypothesis -> assume you are right -> it turns out that you are right.

Quirrell saves Hariezer and I guess in the aftermath Filch and Hagrid both get sacked (we aren’t actually shown this, instead Dumbledore and Hariezer have a discussion about it, because why show when you can have characters talk about! So much more interesting!)

Anyway, Dumbeldore is a bit sad by the loss of Filch and especially Hagrid, but Hariezer says

"Your mistake," Harry said, looking down at his knees, feeling at least ten percent as exhausted as he’d ever been, "is a cognitive bias we would call, in the trade, scope insensitivity. Failure to multiply. You’re thinking about how happy Mr. Hagrid would be when he heard the news. Consider the next ten years and a thousand students taking Magical Creatures and ten percent of them being scalded by Ashwinders. No one student would be hurt as much as Mr. Hagrid would be happy, but there’d be a hundred students being hurt and only one happy teacher."

First “in the trade”? Really?

Anyway, Hariezer isn’t multiplying in the obvious tangible benefits of an enthusiastic teacher who really knows his shit regarding magical creatures. Yes, more students will be scalded but its because there will be SUPER AWESOME LESSONS WHERE KIDS COULD BE SCALDED!

In the balance, I think Hariezer was right about Filch and Dumbledore was right about Hagrid.

Anyway, thats it for this chapter, its a standard “chapter where people do nothing that talk.”

Harry Potter and the Methods of Expository Dialogue.

HPMOR 102: open borders and death spells

Quirrell is still dying, Hariezer brings him a unicorn he turned into a stone.

We learn how horcruxes work in this world:

Only one who doess not believe in common liess will reasson further, ssee beneath obsscuration, realisse how to casst sspell. Required murder iss not ssacrificial ritual at all. Ssudden death ssometimes makess ghosst, if magic burssts and imprintss on nearby thing. Horcrux sspell channelss death-bursst through casster, createss your own ghosst insstead of victim’ss, imprintss ghosst in sspecial device. Ssecond victim pickss up horcrux device, device imprintss your memoriess into them. But only memoriess from time horcrux device wass made. You ssee flaw?”

Wait? A ghost has all the memories of the person who died? Why isn’t Hariezer reading everything he can about how these imprints work? If the Horcrux can transfer ghost-like stuff into a person, could you return any ghost to a new body? I feel like Hariezer just says “I’m going to end death! Humanity should end death! I can’t believe no one is trying to end death!” But he isn’t actually doing anything about it himself.

Also, if that is how a horcrux works WHY THE FUCK WOULD VOLDEMORT PUT ONE ON A PIONEER PROBE? The odds of that encountering people again are pretty much nill. At least we’ve learned horcruxes aren’t conscious- I had assumed Voldemort had condemned one of his copies to an eternity of isolation.

We also learn that in HPMOR world

There is a second level to the Killing Curse. Harry’s brain had solved the riddle instantly, in the moment of first hearing it; as though the knowledge had always been inside him, waiting to make itself known. Harry had read once, somewhere, that the opposite of happiness wasn’t sadness, but boredom; and the author had gone on to say that to find happiness in life you asked yourself not what would make you happy, but what would excite you. And by the same reasoning, hatred wasn’t the true opposite of love. Even hatred was a kind of respect that you could give to someone’s existence. If you cared about someone enough to prefer their dying to their living, it meant you were thinking about them. It had come up much earlier, before the Trial, in conversation with Hermione; when she’d said something about magical Britain being Prejudiced, with considerable and recent justification. And Harry had thought - but not said - that at least she’d been let into Hogwarts to be spat upon. Not like certain people living in certain countries, who were, it was said, as human as anyone else; who were said to be sapient beings, worth more than any mere unicorn. But who nonetheless wouldn’t be allowed to live in Muggle Britain. On that score, at least, no Muggle had the right to look a wizard in the eye. Magical Britain might discriminate against Muggleborns, but at least it allowed them inside so they could be spat upon in person. What is deadlier than hate, and flows without limit? "Indifference," Harry whispered aloud, the secret of a spell he would never be able to cast; and kept striding toward the library to read anything he could find, anything at all, about the Philosopher’s Stone.

So standard open borders stuff, not worth spending time with.

But I want to talk about the magic here- apparently you can only cast the killing curse at people you hate, and toward people you are indifferent toward. So you can’t kill your loved ones! Big limitation!

Also, Hariezer “99% of the fucking planet is NPCs” Yudotter isn’t indifferent to anyone? I call BS.

HPMOR 103: very punny

Credit where credit is due, this whole chapter sets up a pretty clever pun.

The students take an exam, and then receive their final “battle magic” grades. Hermione is failed because she made the mistake of dying. Hariezer gets an exceeds expectations, which Quirrell informs Hariezer “It is the same grade… that I received in my own first year.”

Get it? He marked him as an equal.

HPMOR 104: plot threads hastily tied up/also some nonsense

So this chapter opens with a quidditch game, in an attempt to wrap up an earlier plot thread- Quirrell’s reward for his battle game (a reward given out back in chapter 34 or so, and literally never mentioned again until this chapter) was that slytherin and ravenclaw would tie for the house cup and Hogwarts would stop playing quidditch with the snitch.

Going into this game, Hufflepuff is in the lead for house cup “by something like five hundred points.” Quirrell is out of commission with his sickness, but the students have taken matters into their own hands- it appears the plan is just to not catch the snitch?

It was at eight pm and six minutes, according to Harry’s watch, when Slytherin had just scored another 10 points bringing the score to 170-140, when Cedric Diggory leapt out of his seat and shouted “Those bastards!” "Yeah!" cried a young boy beside him, leaping to his own feet. "Who do they think they are, scoring points?" "Not that!" cried Cedric Diggory. "They’re - they’re trying to steal the Cup from us! " "But we’re not in the running any more for -" "Not the Quidditch Cup! The House Cup!"

What? It’s totally unclear to me how this is supposed to work. In the books, as I remember it, points were awarded for winning quidditch games NOT for simply scoring points within a quidditch game? Winning 500 to 500 will just result in some fixed amount of points going to the winner.

Also, there appears to be a misunderstanding of quidditch:

The game had started at six o’ clock in the afternoon. A typical game would have gone until seven or so, at which point it would have been time for dinner. No, as I recall, games go on for days not one hour. I think the books mention a game lasting longer than a month. No one would be upset at a game where the snitch hasn’t been caught in a few hours.

Basically, this whole thing feels really ill-conceived.

Luckily, the chapter pivots away from the quidditch game pretty quickly, Hariezer gets a letter from himself.

Beware the constellation, and help the watcher of stars and by the wise and the well-meaning. in the place that is prohibited and bloody stupid. Pass unseen by the life-eaters’ confederates, Six, and seven in a square,

I note that Hariezer established way back when somewhere that he has a system in place to communicate with himself, with secret codes for his notes to make sure they really are for him. I’m too lazy to dig this back up, but I definitely remember reading it. Probably in chapter 13 with the time travel game?

Anyway, apparently Hariezer has forgotten this (I hope this comes up and it’s not just a weird problem introduced for no reason?) because this turns out to be a decoy note from Quirrell to lure him to the forbidden corridor. After a whole bunch of people all show up at the forbidden corridor at the same time, and some chaos breaks out, Hariezer and Quirrell are the last men standing, which leads to this:

An awful intuition had come over Harry, something separate from all the reasoning he’d done so far, an intuition that Harry couldn’t put into words; except that he and the Defense Professor were very much alike in certain ways, and faking a Time-Turned message was just the sort of creative method that Harry himself might have tried to bypass all of a target’s protections - … And Professor Quirrell had known a password that Bellatrix Black had thought identified the Dark Lord and his presence gave the Boy-Who-Lived a sense of doom and his magic interacted destructively with Harry’s and his favorite spell was Avada Kedavra and and and … Harry’s mouth was dry, even his lips were trembling with adrenaline, but he managed to speak. “Hello, Lord Voldemort.” Professor Quirrell inclined his head in acknowledgement, and said, “Hello, Tom Riddle.”

We also indirectly find out that Quirrell killed Hermione (but we already knew that), although he did it by controlling professor Sprout (I guess to throw off the scent if he got caught?)

Anyway, this pivotal plot moment seems to rely entirely on the fact that Hariezer forgot his own coded note system?

HPMOR 105

So Quirrell gets Hariezer to cooperate with him, by threatening students, and offering to resurrect Hermione if he gets the philosopher’s stone

And know this, I have taken hostages. I have already set in motion a spell that will kill hundreds of Hogwarts students, including many you called friends. I can stop that spell using the Stone, if I obtain it successfully. If I am interrupted before then, or if I choose not to stop the spell, hundreds of students will die.

Hariezer does manage to extract a concession:

Agreed,” hissed Professor Quirrell. “Help me, and you sshall have ansswerss to your quesstions, sso long ass they are about passt eventss, and not my planss for the future. I do not intend to raisse my hand or magic againsst you in future, sso long ass you do not raisse your hand or magic againsst me. Sshall kill none within sschool groundss for a week, unlesss I musst. Now promisse that you will not attempt to warn againsst me or esscape. Promisse to put forth your own besst efforts toward helping me to obtain the Sstone. And your girl-child friend sshall be revived by me, to true life and health; nor sshall me or mine ever sseek to harm her.” A twisted smile. “Promisse, boy, and the bargain will be sstruck.”

So coming up we’ll get one of those chapters where the villain explains everything. Always a good sign when the villain does apparently nothing for 90 or so out of 100 chapters, and then explains the significance of everything at the very end.

HPMOR 106

Not much happens here, Quirrell kills the 3 headed cerberus to get past the first puzzle. When Hariezer points out that might have alerted someone, Quirrell is all “eh, I fucked all the wards up.”

So I guess more time to go before we get the villain monologue chapter.

HPMOR 107

Still no villain monologue. Quirrell and Hariezer encounter the other puzzles from the book, and Quirrell blasts them to death with magic fire rather than actually solve them.

However, Quirrell has some random reasons to not blast apart the potion room (he respects Snape or something, blah,bah). Anyway, apparently this means he’ll have to make a long and complicated potion, which will give Quirrell and Hariezer some time to talk.

Side note: credit where credit is due, I again notice these chapters flow much better, and have a much smoother writing style. There is some wit in the way that Quirrell just hulk-smashes all the puzzles (although stopping at Snape;s puzzle seems like a contrived way to drop the monologue we know is coming next chapter or so into the story) When things are happening, HPMOR can be decently written.

HPMOR 108: monologue

So we get the big “explain everything” monologue, and it’s kind of a let down?

The first secret we get- Hariezer is indeed a copy of Voldemort (which was just resolving some dramatic irony, we all knew this because we read the books). In a slight twist, we find out that he was intentionally a horcrux-

It occurred to me how I might fulfill the Prophecy my own way, to my own benefit. I would mark the baby as my equal by casting the old horcrux spell in such fashion as to imprint my own spirit onto the baby’s blank slate… I would arrange with the other Tom Riddle that he should appear to vanquish me, and he would rule over the Britain he had saved. We would play the game against each other forever, keeping our lives interesting amid a world of fools.

But apparently creating the horcrux created some sort of magic resonance and killed his body. But he had somehow made true-immortal horcruxes. Unfortunately, he had put them stupid places like on the pioneer probe or in volcanos where people would never touch them, so he never managed to find a host (remember when I complained about that a few chapters back?)

Hariezer does point out that Voldemort should have tested the new horcrux spell. He suggests Voldie failed to do so because he doesn’t think about doing nice things, but Voldie could have just horcruxed someone, killed them to test it, then destroyed the horcrux, then killed them for real. Not nice, pretty straightforward. It feels like this is going to be Voldemort’s weakness that gets exploited.

We find out that the philosopher’s stone makes transfigurations permanent, which I guess is a minor twist on the traditional legend? Really, just a specific way of making it work- in the legends it can transmute metals, heal sickness, bring dead plants back to life, let you make homunculi,etc.

In HPMOR, powerful magical artifacts can’t have been produced recently, because lore is lost of whatever, so we get a grimdark history of the stone, involving a Hogwarts student seducing professor Baba Yaga to trick her into taking her virginity so she could steal the stone. Really incidental to anything. Anyway, Flamel, who stole the stone, is both man and woman and uses the stone to transmute back and forth, and apparently gave Dumbledore power to fight Grindlewald.

Quirrell killed Hermione (duh) because

I killed Miss Granger to improve your position relative to that of Lucius Malfoy, since my plans did not call for him to have so much leverage over you.

I don’t think this actually makes much sense at all? It’s pretty clear Voldie plans to kill Hariezer as soon as this is over, so why should he care about Malfoy at all in this? I had admittedly assume he killed Hermione to help his dark side take over Hariezer or something.

Apparently they raided Azkaban to find out where Black had hidden Quirrell’s wand.

Also, as expected, Voldemort was both Monroe and Voldemort and was playing both sides in order to gain political power. He wanted to get political power because he was afraid muggles would destroy the world.

Basically, every single reveal is basically what you’d expect from the books. Harry Potter and The Obvious Villain Monologue.

The only open question is why the Hariezer-crux, given how that spell is supposed to work, didn’t have any of Voldemort’s memories up until that time? I expect we are supposed to chalk it up to “the spell didn’t quite work because of the resonance that blew everything up” or whatever.

HPMOR 109: supreme author wank

We get to the final mirror, and we get this bit of author wank:

Upon a wall of metal in a place where no one had come for centuries, I found written the claim that some Atlanteans foresaw their world’s end, and sought to forge a device of great power to avert the inevitable catastrophe. If that device had been completed, the story claimed, it would have become an absolutely stable existence that could withstand the channeling of unlimited magic in order to grant wishes. And also - this was said to be the vastly harder task - the device would somehow avert the inevitable catastrophes any sane person would expect to follow from that premise. The aspect I found interesting was that, according to the tale writ upon those metal plates, the rest of Atlantis ignored this project and went upon their ways. It was sometimes praised as a noble public endeavor, but nearly all other Atlanteans found more important things to do on any given day than help. Even the Atlantean nobles ignored the prospect of somebody other than themselves obtaining unchallengeable power, which a less experienced cynic might expect to catch their attention. With relatively little support, the tiny handful of would-be makers of this device labored under working conditions that were not so much dramatically arduous, as pointlessly annoying. Eventually time ran out and Atlantis was destroyed with the device still far from complete. I recognise certain echoes of my own experience that one does not usually see invented in mere tales.”

Get it? It’s friendly AI and we are all living in Atlantis! And Yud is bravely toiling away in obscurity to save us all! (Note: toiling in obscurity in this context means soliciting donations to continue running one of the least productive research organizations in existence.)

Anyway, after this bit of wankery, Voldie and Hariezer return to the problem of how to get the stone. The answer turns out to be by confounding himself into thinking that he is Dumbledore wanting the stone back after Voldemort has been defeated.

I point out that the book’s original condition where the way to get the stone was to not want the stone was vastly more clever. Don’t think of elephants, and all that.

Anyway, after Voldemort gets the stone, Dumbledore shows up.

HPMOR 110

Apparently Dumbledore was going to use the mirror as a trap to banish Voldemort. But when he saw Hariezer was with him, Dumbledore sacrificed himself to save Hariezer. So now Dumbledore is banished somewhere.

So I read chapter 114

And I kind of don’t get how Hariezer’s solution worked? He turned the ground near his wand into spider silk, but how did he get it around everyone’s neck? Why did he make it spider silk before turning it into nanowire?

HPMOR 111

So in this chapter, Hariezer is stripped of wand, pouch and time turner, and Voldemort has the philosopher’s stone. It’s looking pretty bad for our hero.

Voldemort walks Hariezer to an altar he has prepared,does some magic stuff, and a new shiny body appears for him. So now he has fully resurrected.

Voldemort then brings back Hermione (Hariezer HAD stolen the remains). I note that I don’t think Hariezer has actually done anything in this entire story- his first agenda- study magic using science- was a complete bust, and then his second agenda was a big resolution to bring back Hermione, but Voldemort did it for him. Voldemort also crossed Hermione with a troll and unicorn (creating a class Trermionecorn) so she is now basically indestructible. Why didn’t Voldemort do this to his own body? No idea. Why would someone obsessed with fighting death and pissed as hell about how long it took him to reclaim his old body think to give Hermione wolverine-level indestructibility but not himself? Much like the letter Hariezer got to set this up, it’s always bad when characters have to behave out of character in order to set the plot up.

Anyway, to bring Hermione back Voldie gives Hariezer his wand back so he can hit her with his super-patronus. So he now has a wand.

Voldemort then asks Hariezer for Roger Bacon’s diary (which he turns into a Hermione horcrux), which prompts Hariezer to say

I tried translating a little at the beginning, but it was going slowly -” Actually, it had been excruciatingly slow and Harry had found other priorities.

Yep, doing experiments to discover magic, despite being the premise of the first 20ish chapters then immediately stopped being any priority at all. Luckily it shows up here as an excuse for Hariezer to get his pouch back (he retrieved it to get the diary).

While Voldemort is distracted making the horcrux, Hariezer whips a gun out of the pouch and shoots Voldemort. Something tells me it didn’t work.

HPMOR 112

Not surprisingly, the shots did nothing to Voldemort who apparently can create a wall of dirt faster than a bullet travels. Apparently it was a trick, because Voldemort had created some system where no Tom Riddle could kill any other Tom Riddle unless the first had attacked him. Somehow, Hariezer hadn’t been bound to it, and now neither of them are.

I don’t know why Yudkowsky introduced this curse and then had it broken immediately? It would have been more interesting to force Hariezer to deal with Voldemort without taking away his immortality. Actually, given all the focus on memory charms in the story, it’s pretty clear that when he achieves ultimate victory Hariezer will do it with a memory charm- turning Tom Riddle into an immortal amnesiac or implanting other memories in him (so he is an immortal guy who works in a restaurant or something).

In retaliation, Voldemort hits him with a spell that takes everything but his glasses and wand, so he is naked in a graveyard, and a bunch of death eaters teleport in (37 of them).

HPMOR 113

This is a short chapter, Hariezer is still in peril. As Voldemort’s death eaters pop in, one of them tries to turn against him but Voldemort kills him dead.

And then he forces Hariezer to swear an oath to not try to destroy the world- Voldemort’s plan is apparently to resurrect Hermione, force Hariezer to not destroy the world, and then to kill him. (I note that if he did things in a slightly different order, he’d have already won…)

So this chapter ends with Hariezer surrounded by death eaters, all with wands pointed at him, ready to kill him, and we get this (sorry, it’s a long quote).

This is your final exam. Your solution must at least allow Harry to evade immediate death, despite being naked, holding only his wand, facing 36 Death Eaters plus the fully resurrected Lord Voldemort. 12:01AM Pacific Time (8:01AM UTC) on Tuesday, March 3rd, 2015, the story will continue to Ch. 121. Everyone who might want to help Harry thinks he is at a Quidditch game. he cannot develop wordless wandless Legilimency in the next 60 seconds. the Dark Lord’s utility function cannot be changed by talking to him. the Death Eaters will fire on him immediately. if Harry cannot reach his Time-Turner without Time-Turned help - then the Time-Turner will not come into play. Harry is allowed to attain his full potential as a rationalist, now in this moment or never, regardless of his previous flaws. if you are using the word ‘rational’ correctly, is just a needlessly fancy way of saying ‘the best solution’ or ‘the solution I like’ or ‘the solution I think we should use’, and you should usually say one of the latter instead. (We only need the word ‘rational’ to talk about ways of thinking, considered apart from any particular solutions.) if you know exactly what a smart mind would do, you must be at least that smart yourself….

The issue here is that literally any solution, no matter how outlandish will work if you can design the rules to make the munchkin work.

Hariezer could use partial transfiguration to make an atomic lattice under the earth with a few atoms of anti-matter under each death eater to blow up each of them.

He could use the fact that Voldemort apparently cast broomstick spells on his legs to levitate Voldemort. And after levitating Voldemort away convince the death eaters he is even more powerful than Voldie.

He could use the fact that Tom Riddle controls the dark mark to burn all the death eaters where they stand- Voldemort can’t kill Hariezer with magic because of the resonance, and a simple shield should stop bullets.

He could partially transfigure a dead man’s switch of sorts, so that if he dies fireworks go up and the quidditch game gets disrupted- he knows it didn’t, so he knows he won’t die.

In my own solution I posted earlier, he could talk his way out using logic-ratio-judo. Lots of options other than my original posted solution here- convince them that killing Hariezer makes the prophecy come true because of science reasons,etc.

He could partially transfigure poison gas or knock out gas.

Try to come up with your own outlandish solution. The thing is, all of these work because we don’t really know the rules of HPMOR magic- we have a rough outline, but there is still a lot of author flexibility. So there is that.

HPMOR 114: solutions to the exam

Hariezer transfigures part of his wand into carbon nanotubes that he makes into a net around each death eater’s head and around voldemort’s hands.

He uses it to decapitate all the death eaters (killing Malfoy’s father almost certainly) and taking Voldemort’s hand’s off.

Voldemort then charges at him but Hariezer hits him with a stunning spell.

I note that this solution is as outlandish as any.

It WAS foreshadowed in the very first chapter, but that doesn’t make it less outlandish.

HPMOR 115-116-117

Throughout the story, whenever there is action we then have 5-10 chapters where everyone digests what happened. These chapters all fit in that vein.

Hariezer rearranges the scene of his victory to put Voldemort’s hands around Hermione’s neck and then rigs a big explosion. He then time turners back to the quidditch game (why the hell did he have time left on his time turner?)

Anyway, when he gets back to the game he does a whole “I CAN FEEL THE DARK LORD COMING” thing, and says the Hermione followed him back. I guess the reason for this plot is just to keep him out of it/explain how Hermione came back? You’d think he could have just hung around the scene of the crime, waited ‘till he was discovered and explain what happened?

Then in chapter 117, Mcgonagall explains to the school what was found- Malfoy’s father is dead, Hermione is alive, etc.

So after the big HPMOR reveal

It sort of feels like HPMOR is just the overly wordy gritty Potter reboot with some science stuff slapped on to the first 20 chapters or so.

Like, Potter is still a horcrux, Voldemort still wanted to take take over the wizard world and kill the muggles, etc.

Even the anti-death themes have fallen flat because of “show, don’t tell”- Dumbledore was a “deathist” but he was on the side of not killing all the muggles, Voledmort actually defeated death but that position rides along side the kill-everyone ethos, and Hariezer’s resolution to end death apparently was going to blow up the entire world. So the characters might argue the positions, but for reasons I don’t comprehend, actually following through is shown as a terrible idea.

I read the rest of HPMOR/113 puzzle

I read HPMOR, and will put chapter updates when I have time, but I wanted to put down my version of how Hariezer will get out of the predicament at the end of 113. I fear if I put this down after the next chapter is released, and if it’s correct, people will say I looked ahead.

Anyway, the way that this would normally be solved in HPMOR is simple the time turner- Hariezer would resolve to go and find Mcgonagall or Bones or whoever when this was all over and tell them to time turner into this location and bring the heat. Or whatever. But that has been disallowed by the rules.

But I think Yud is setting this up as a sort of “AI box” experiment, because he has his obsessions and they show up time and time again. So the solution is simply to convince Voldemort to let him go. How? In a version of Roko’s basilisk he needs to convince Voldemort that they might be in a simulation- i.e. maybe they are still looking in the mirror. Hasn’t everything gone a little too well since they first looked in? Dumbledore was vanquished, bringing back Hermione was practically easy, every little thing has been going perfectly. Maybe the mirror is just simulating what he wants to see?

So two ways to go from here- Hariezer is also looking in the mirror (and he has also gotten what he wanted, Hermione being brought back) so he might be able to test this just by wishing for something.

Or, Hariezer can convince Voldemort that the only way to know for sure is for Voldemort to fail to get something he wants, and the last thing he wants is for Hariezer to die.

HPMOR 118

The story is still in resolution mood, and I want to point out one thing that this story does right that the original books failed at- which is an actual resolution. The one big failure of the original Harry Potter books, in my mind, was that after Voldemort was defeated, we flashed immediately to the epilogue. No funeral for the departed (no chance to say goodbye to the departed Fred Weasley,etc).

Of course, in the HPMOR style, there is a huge resolution after literally every major even in the story, so it’s at least in part a stopped clock situation.

This chapter is Quirrell’s funeral, which is mostly a student giving a long eulogy (remember, Hariezer dressed things up to make it look like Quirrell died fighting Voldemort, which is sort of true, but not the Quirrell anyone knew.)

HPMOR 119

Still in resolution mode.

Hariezer comes clean with (essentially) the order of the Phoenix members, and tells them about how Dumbledore is trapped in the mirror. This leads to him receiving some letters Dumbledore left.

We find out that Dumbledore has been acting to fulfill a certain prophecy that Hariezer plays a role in-

Yet in your case, Harry, and in your case alone, the prophecies of your apocalypse have loopholes, though those loopholes be ever so slight. Always ‘he will end the world’, not ‘he will end life’.

So I guess he’ll bring in the transhumanist future.

Hariezer has also been given Dumbledore’s place, which Amelia Bones is annoyed at, so he makes her regent until he is old enough.

We also get one last weird pointless rearrangement of the canon books- apparently Peter Pettigrew was one of those shape shifting wizards and somehow got tricked into looking like Sirius Black. So the wrong person has been locked up in Azkaban. I don’t really “get” this whole Sirius/Peter were lovers/Sirius was evil/the plot of book 3 was a schizophrenic delusion running thread (also, Hariezer deduces, with only the evidence that there is a Black in prison and a dead Black among the death eaters, that Peter Pettigrew was a shapeshifter, that Peter immitated black, and that Peter is the one in Azkaban.)

And Hariezer puts a plan in place to open a hospital using the philsophers stone, so BAM, death is defeated, at least in the wizarding world. Unless it turns out the stone has limits or something.

HPMOR 120

More resolution.

Hariezer comes clean to Draco about how the death eaters died. (why did he go to the effort of the subterfuge, if he was going to come clean to everyone afterwards? It just added a weird layer to all this resolution).

Draco is sad his parents are dead. BUT, surprise- as I predicted way back when, Dumbledore only faked Narcissa Malfoy’s death and put her in magic witness protection.

I think one of the things I strongly dislike about HPMOR is that there doesn’t seem to be any joy...

I think one of the things I strongly dislike about HPMOR is that there doesn’t seem to be any joy purely in the discovery. People have fun playing the battle games, or fighting bullies with time turners, or generally being powerful, but no one seems to have fun just trying to figure things out.

For some reason (the reason is that I have a fair amount of scotch in me actually), my brain keeps trying to put together an imprecise metaphor to old SNES rpgs- a friend of mine in grade school loved FF2, but he always went out of his way to find all the powerups and do all the side quests,etc. This meant he was always powerful enough to smash boss fights in one or two punches. And I always hated that- what is the fun in that? What is the challenge? When things got too easy, I started running from all the random encounters and stopped buying equipment so that the boss battles were more fun.

And HPMOR feels like playing the game the first way- instead of working hard at the fun part (discovery), you get to just use Aristotle’s method (Harry Potter and Methods of Aristotelian Science) and slap an answer down. And that answer makes you more powerful- you can time turner all your problems away like shooing a misquito with a flamethrower, when a dementor shows up you get to destroy it just by thinking hard- no discovery required. The story repeatedly skips the fun part- the struggle, the learning, the discovery.

HPMOR 121

Snape leaves hogwarts, thus completing an arc I don’t think I ever cared about.

HPMOR 122: the end of the beginning

So unlike the canon books, the end of HPMOR sets it up more as an origin story than a finished adventure. After the canon books, we get the impression Harry, Ron and Hermione settled into peaceful wizard lives. After HPMOR, Hariezer has set up a magical think tank to solve the problem of friendly magic, with Hermione as his super-powered, indestructible lab assistant (tell me again how Hariezer isn't a self insert?) , and we get the impression the real work is just starting. He also has the idea to found CFAR:

It would help if Muggles had classes for this sort of thing, but they didn't. Maybe Harry could recruit Daniel Kahneman, fake his death, rejuvenate him with the Stone, and put him in charge of inventing better training methods...

We also learn that a more open society of idea sharing is an idea so destructive that Hariezer's vow to end the world wouldn't let him do it:

Harry could look back now on the Unbreakable Vow that he'd made, and guess that if not for that Vow, disaster might have already been set in motion yesterday when Harry had wanted to tear down the International Statute of Secrecy.

So secretive magiscience lead by Hariezer (with Hermione as super-powered "Sparkling Unicorn Princess" side kick) will save the day, sometime in the future.

su3su2u1 physics tumblr archive

2016-03-01 08:00:00

These are archived from the now defunct su3su2u1 tumblr.

A Roundabout Approach to Quantum Mechanics

This will be the first post in what I hope will be a series that outlines some ideas from quantum mechanics. I will try to keep it light, and not overly math filled- which means I’m not really teaching you physics. I’m teaching you some flavor of the physics. I originally wrote here “you can’t expect to make ice cream just having tasted it,” but I think a better description might be “you can’t expect to make ice cream just having heard someone describe what it tastes like.” AND PLEASE, PLEASE PLEASE ask questions. I’m used to instant feedback on my (attempts at) teaching, so if readers aren’t getting anything out of this, I want to stop or change or something.

Now, unfortunately I can’t start with quantum mechanics without talking about classical physics first. Most people think they know classical mechanics, having learned it on their mother’s knee, but there are so,so many ways to formulate classical physics, and most physics majors don’t see some really important ones (in particular Hamiltonian and Lagrangian mechanics) until after quantum mechanics. This is silly, but at the same time university is only 4 years. I can’t possibly teach you all of these huge topics, but I will need to rely on a few properties of particle and of light. And unlike intro Newtonian mechanics, I want to focus on paths. Instead of asking something like “a particle starts here with some velocity, where does it go?” I want to focus on “a particle starts here, and ends there. What path did it take?”

So today we start with light, and a topic I rather love. Back in the day, before “nerd-sniping” several generations of mathematicians, Fermat was laying down a beautiful formulation of optics-

Light always takes the path of least time

I hear an objection “isn’t that just straight lines?” We have to combine this insight with the notion that light travels at different speeds in different materials. For instance, we know light slows down in water by a factor of about 1.3.

So lets look at a practical problem, you see a fish swimming in water (I apologize in advance for these diagrams):

I drew the (hard to see) dotted straight line between your eye and the fish.

But that isn’t what the light does- there is a path that saves the light some time. The light travels faster in air than in water, so it can travel further in the air, and take a shorter route in the water to the fish.

This is a more realistic path for the light- it bends when it hits the water- it does this in order to take paths of least time between points in the water and points in the air. Exercise for the mathematical reader- you can work this out quantitatively and derive Snell’s law (the law of refraction) just from the principle of least time.

And one more realistic example: Lenses. How do they work?

So that bit in the middle is the lens and we are looking at light paths that leave 1 and travel to 2 (or visa versa, I guess).

The lens is thickest in the middle, so the dotted line gets slowed the most. Path b is longer, but it spends less time in the lens- that means with careful attention to the shape of the lens we can make the time of path b equal to the time of the dotted path.

Path a is the longest path, and just barely touches the lens, so is barely slowed at all, so it too can be made to take the same time as the dotted path (and path b).

So if we design our lens carefully, all of the shortest-time paths that touch the lens end up focused back to one spot.

So thats the principle of least time for light. When I get around to posting on this again we’ll talk about particles.

Now, these sort of posts take some effort, so PLEASE PLEASE PLEASE tell me if you got something out of this.

Edit: And if you didn’t get anything out of this, because its confusing, ask questions. Lots of questions, any questions you like.

More classical physics of paths

So after some thought, these posts will probably be structured by first discussing light, and then turning to matter, topic by topic. It might not be the best structure, but its at least giving me something to organize my thoughts around.

As in all physics posts, please ask questions. I don’t know my audience very well here, so any feedback is appreciated. Also, there is something of an uncertainty principle between clarity and accuracy. I can be really clear or really accurate, but never both. I’m hoping to walk the middle line here.

Last time, I mentioned that geometric optics can be formulated by the simple principle that light takes the path of least time. This is a bit different than many of the physics theories you are used to- generally questions are phrased along the lines of “Alice throws a football from position x, with velocity v, where does Bob need to be to catch the football.” i.e. we start with an initial position and velocity. Path based questions are usually “a particle starts at position x_i,t_i and ends at position x_f,t_f, what path did it take?”

For classical non-relativistic mechanics, the path based formulation is fairly simple, we construct a quantity called the “Lagrangian” which is defined by subtracting potential energy from kinetic energy (KE - PE). Recall that kinetic energy is 1/2 mv^2, where m is the mass of the particle and v is the velocity, and potential energy depends on the problem. If we add up the Lagrangian at every instant along a path we get a quantity called the action (S is the usual symbol for action, for some reason) and particles take the path of least action. If you know calculus, we can put this as

[S = \int KE-PE dt ]

The action has units of energy*time, which will be important in a later post.

Believe it or not, all of Newton’s laws are all contained in this minimization principle. For instance, consider a particle moving with no outside influences (no potential energy). Such a particle has to minimize its v^2 over the path it takes.

Any movement away from the straight line will cause an increase in the length of the path, so the particle will have to travel faster, on average, to arrive at its destination. We want to minimize v^2, so we can deduce right away the particle will take a straight line path.

But what about its speed? Should a particle move very slowly to decrease v^2 as it travels, and then “step on the gas” near the end? Or travel at a constant speed? Its easy to show that minimum action is the constant speed path (give it a try!). This gives us back Newton’s first law.

You can also consider the case of a ball thrown straight up into the air. What path should it take? Now we have potential energy mgh (where h is the height, and g is a gravitational constant). But remember, we subtract the potential energy in the action- so the particle can lower its action by climbing higher.

Along the path of least action in a gravitational field, the particle will move slowly at high h to spend more time at low-action, and will speed up as h decreases (it needs to have an average velocity large enough to get to its destination on time). If you know calculus of variations, you can calculate the required relationship, and you’ll find you get back exactly the Newtonian relationship (acceleration of the particle = g).

Why bother with this formulation? It makes a lot of problems easier. Sometimes specifying all the forces is tricky (imagine a bead sliding on a metal hoop. The hoop constrains the bead to move along the circular hoop, so the forces are just whatever happens to be required to keep the bead from leaving the hoop. But the energy can be written very easily if we use the right coordinates). And with certain symmetries its a lot more elegant (a topic I’ll leave for another post).

So to wrap up both posts- light takes the path of least time, particles take the path of least action. (One way to think about this is that light has a Lagrangian that is constant. This means that the only way to lower the action is to find the path that takes the least time). This is the take away points I need for later- in classical physics particles take the path of least action.

I feel like this is a lot more confusing than previous posts because its hard to calculate concrete examples. Please ask questions if you have them.

Semi classical light

As always math will not render properly on tumblr dash, but will on the blog. This post contains the crux of this whole series of posts, so its really important to try to understand this argument.

Recall from the first post I wrote that one particularly elegant formulation of geometric optics is Fermat’s principle:

light takes the path of least time

But, says a young experimentalist (pun very much intended!), look what happens when I shine light through two slits, I get a pattern like this:

Light must be a wave.

"Wait, wait, wait!" I can hear you saying. Why does this two slit thing mean that light is a wave?

Let us talk about the key feature of waves- when waves come together they can combine in different ways:

TODO: broken link.

So when a physicists want to represent waves, we need to take into account not just the height of the wave, but also the phase of the wave. The wave can be at “full hump” or “full trough” or anywhere in between.

The technique we use is called “phasors” (not to be confused with phasers). We represent waves as little arrows, spinning around in a circle:

TODO: broken link.

The ;length of the arrow A is called the amplitude and represents the height of the wave. The angle, (\theta) represents the phase of the wave. (The mathematical sophisticates among us will recognize these as complex numbers of the form (Ae^{i\theta}) With these arrows, we can capture all the add/subtract/partially-add features of waves:

TODO: broken link.

So how do we use this to explain the double slit experiment? First, we assume all the light that leaves the same source has the same amplitude. And the light has a characteristic period, T. It takes T seconds for the light to go from “full trough” back to “full trough” again.

In our phasor diagram, this means we can represent the phase of our light after t seconds as:

[\theta = \frac{2\pi t}{T} ]

Note, we are taking the angle here in radians. 2 pi is a full circle. That way when t = T, we’ve gone a full circle.

We also know that light travels at speed c (c being the “speed of light,” after all). So as light travels a path of length L, the time it traveled is easily calculated as (\frac{L}{c}).

Now, lets look at some possible paths:

The light moves from the dot on the left, through the two slits, and arrives at the point X. Now, for the point X at the center of the screen, both paths will have equal lengths. This means the waves arrive with no difference in phase, and they add together. We expect a bright spot at the center of the screen (and we do get one).

Now, lets look at points further up the screen:

As we move away from the center, the paths have different lengths, and we get a phase difference in the arriving light:

[\theta_1 - \theta_2= \frac{2\pi }{cT} \left(L_1 - L_2\right) ]

So what happens? As we move up the wall, the length distance gets bigger and the phase difference increases. Every time the phase difference is a multiple of pi we get cancellation, and a dark spot. Every time its a multiple of 2 pi, we get a bright spot. This is exactly Young’s results.

But wait a minute, I can hear a bright student piping up (we’ll call him Feynman, but it would be more appropriate to call him Huygens in this case). Feynman says “What if there were 3 slits?”

Well, then we’d have to add up the phasors for 3 different slits. Its more algebra, but when they all line up, its a bright spot, when they all cancel its a dark spot,etc. We could even have places where two cancel out, and one doesn’t.

"But, what if I made a 4th hole?" We add up four phasors. "A 5th? "We add up 5 phasors.

"What if I drilled infinite holes? Then the screen wouldn’t exist anymore! Shouldn’t we recover geometric optics then?"

Ah! Very clever! But we DO recover geometric optics. Think about what happens if we add up infinitely many paths. We are essentially adding up infinitely many random phasors of the same amplitude:

So we expect all these random paths to cancel out.

But there is a huge exception.

Those random angles are because when we grab an arbitrary path, the time light takes on that path is random.

But what happens near a minimum? If we parameterize our random paths, near the minimum the graph of time-of-travel vs parameter looks like this:

The graph gets flat near the minimum, so all those little Xs have roughly the same phase, which means all those phasors will add together. So the minimum path gets strongly reinforced, and all the other paths cancel out.

So now we have one rule for light:

To calculate how light moves forward in time, we add up the associated phasors for light traveling every possible path.

BUT, when we have many, many paths we can make an approximation. With many, many paths the only one that doesn’t cancel out, the only one that matters, is the path of minimum time.

Semi-classical particles

Recall from the previous post that we had improved our understanding of light. Light we suggested, was a wave which means

Light takes all possible paths between two points, and the phase of the light depends on the time along the path light takes.

Further, this means:

In situations where there are many, many paths the contributions of almost all the paths cancel out. Only the path of least time contributes to the result.

The astute reader can see where we are going. We already learned that classical particles take the path of least action, so we might guess at a new rule:

Particles take all possible paths between two points, and the phase of the particle depends on the action along the path the particle takes.

Recall from the previous post that the way we formalized this is that the phase of light could be calculated with the formula

[\theta = \frac{2\pi}{T} t]

We would like to make a similar formula for particles, but instead of time it must depend on the action, but will we do for the particle equivalent of the “period?” The simplest guess we might take is a constant. Lets call the constant h, planck’s constant (because thats what it is). It has to have the same units of action, which are energy*time.

[\theta = \frac{2\pi}{h} * S]

Its pretty common in physics to use a slightly different constant (\hbar = \frac{h}{2\pi} ) because it shows up so often.

[\theta = \frac{S}{\hbar}]

So we have this theory- maybe particles are really waves! We’ll just run a particle through a double slit and we’ll see a pattern just like the light!

So we set up our double slit experiment, throw a particle at the screen, and blip. We pick up one point on the other side. Huh? I thought we’d get a wave. So we do the experiment over and over again, and this results

So we do get the pattern we expected, but only built up over time. What do we make of this?

Well, one things seems obvious- the outcome of a large number of experiments fits our prediction very well. So we can interpret the result of our rule as a probability instead of a traditional fully determined prediction. But probabilities have to be positive, so we’ll say the probability is proportional to the square of our amplitude.

So lets rephrase our rule:

To predict the probability that a particle will arrive at a point x at time t, we take a phasor for every possible path the particle can take, with a phase depending on the action along the path, and we add them all up. Squaring the amplitude gives us the probability.

Now, believe it or not, this rule is exactly equivalent to the Schroedinger equation that some of us know and love, and pretty much everything you’ll find in an intro quantum book. Its just a different formulation. But you’ll note that I called it “semi-classical” in the title- thats because undergraduate quantum doesn’t really cover fully quantum systems, but thats a discussion for a later post.

If you are familiar with Yudkowsky’s sequence on quantum mechanics or with an intro textbook, you might be used to thinking of quantum mechancis as blobs of amplitude in configuration space changing with time. In this formulation, our amplitudes are associated with paths through spacetime.

When next I feel like writing again, we’ll talk a bit about how weird this path rule really is, and maybe some advantages to thinking in paths.

Basic special relativity

LNo calculus or light required, special relativity using only algebra. Note- I’m basically typing up some lecture notes here, so this is mostly a sketch.

This derivation is based on key principle that I believe Galileo first formulated-

The laws of physics are the same in any inertial frame OR there is no way to detect absolute motion. Like all relativity derivations, this is going to involve a thought experiment. In our experiment we have a train that moves from one of train platform to the other. At the same time a toy airplane also flies from one of the platform to the other (originally, I had made a Planes,Trains and Automobiles joke here, but kids these days didn’t get the reference… ::sigh::)

There are two events, event 1- everything starts at the left side of the platform. Event 2- everything arrives at the right side of the platform. The entire time the train is moving with a constant velocity v from the platform’s perspective (symmetry tells us this also means that the platform is moving with velocity v from the train’s perspective.)

We’ll look at these two events from two different perspectives- the perspective of the platform and the perspective of the train. The goal is to figure out a set of equations that let us relate quantities between the different perspectives.

HERE COMES A SHITTY DIAGRAM

The dot is the toy plane, the box is the train. L is the length of the platform from its own perspective. l is the length of the train from it’s own perspective. T is the time it takes the train to cross the platform from the platform’s perspective. And t is the time the platform takes to cross the train from the train’s perspective.

From the platform’s perspective, it’s easy to see the train has length l’ = L - vT. And the toy plane has speed w = L/T.

From the train’s perspective, the platform has length L’ = l + vt and the toy plane has speed u = l/t

So to summarize

Observer | Time passed between events | Length of Train | Speed of Plane

Platform | T | l’ = L-vT | w = L/T

Train | t |L’ = l+vt | u/t

Now, we again exploit symmetry and our Galilean principle. By symmetry,

l’/l = L’/L = R

Now, by the Galilean principle, R as a function can only depend on v. If it didn’t we could detect absolute motion. We might want to just assume R is 1, but we wouldn’t be very careful if we did.

So what we do is this- we want to write a formula for w in terms of u and v and R (which depends only on v). This will tell us how to relate a velocity in the train’s frame to a velocity in the plane’s frame.

I’ll skip the algebra, but you can use the relations above to work this out for yourself

w = (u+v)/(1+(1-R)u/v) = f(u,v)

Here I just used f to name the function there.

I WILL EDIT MORE IN, POSTING NOW SO I DON’T LOSE THIS TYPED UP STUFF.

More Special Relativity and Paths

This won’t make much sense if you haven’t read my last post on relativity. Math won’t render on tumblr dash, instead go to the blog.

Last time, we worked out formulas for length contraction (and I asked you to work out a formula for time dilation). But what would generally be useful is a formula generally relating events between the different frames of reference. Our thought experiment had two events-

event 1, the back end of the train, the back end of the platform, and the toy are all at the same place.

event 2- the front of the train, the front end of the platform, and the toy are all at the same place.

From the toy’s frame of reference, these events occur at the same place, so we only have a difference between the two events. We’ll call that difference (\Delta\tau) . We’ll always use this to mean “the time between events that occur at the same position” (only in one frame will events occur in the same place), and it’s called proper time.

Now, the toy train sees the platform moves with speed -w, and the length of the platform is RL. So this relationship is just time = distance/speed.

[\Delta\tau^2 = R^2L^2/w^2 = (1-\frac{w^2}{c^2})L^2/w^2 ]

Now, we can manipulate the right hand side by noting that from the platform’s perspective, L^2/w^2 is the time between the two events, and those two events are separated by a distance L. We’ll call the time between events in the platforms frame of reference (\Delta t) and the distance between the events L, we’ll call generally (\Delta x).

[\Delta\tau^2 = (1-\frac{w^2}{c^2})L^2/w^2 = (\Delta t^2 - \Delta x^2/c^2) ]

Note that the speed w has dropped out of the final version of the equation- this would be true for any frame, since proper time is unique (every frame has a different time measurement, but only one measures the proper time), we have a frame independent measurement.

Now, lets relate this back to the idea of paths that I’ve discussed previously. One advantage of the path approach to mechanics is that if we can create a special relativity invariant action then the mechanics we get is also invariant. So one way we might consider to do this is by looking at proper time- (remember S is the action). Note the negative sign- without it there is no minimum only a maximum.

[ S \propto -\int d\tau = -C\int d\tau ]

Now C has to have units of energy for the action to have the right units.

Now, some sketchy physics math

[ S = -C\int \sqrt(dt^2 - dx^2/c^2) = -C\int dt \sqrt(1-\frac{dx^2}{dt^2}/c^2) ]

[S = -C\int dt \sqrt(1-v^2/c^2) ]

So one last step is to note that the approximation we can make for \sqrt(1-v^2/c^2), if v is much smaller than c which is (1-1/2v^2/c^2)

So all together, for small v

[S = C\int dt (\frac{v^2}{2c^2} - 1)]

So if we pick the constant C to be mc^2, then we get

[S = \int dt (1/2 mv^2 - mc^2)]

We recognize the first term as just the kinetic energy we had before! The second term is just a constant and so won’t effect where the minimum is. This gives us a new understanding of our path rule for particles- particles take the path of maximum proper time (it’s this understanding of mechanics that translates most easily to general relativity)

Special relativity and free will

Imagine right now, while you are debating whether or not to post something on tumblr, some aliens in the andromeda galaxy are sitting around a conference table discussing andromeda stuff.

So what is the “space time distance” between you right now (deciding what to tumblrize) and those aliens?

Well, the distance between andromeda and us is something like 2.5 million light years. So thats a “space time distance” tau (using our formula from last time) of 2.5 million years. So far, so good:

Now, imagine an alien, running late to the andromeda meeting, is running in. He is running at maybe 1 meter/second. We know that for him lengths will contract and time will dilate. So for him, time on earth is actually later- using

(\Delta \tau^2 = \Delta t^2 - \Delta x^2/ c^2)

and using our formula for length contraction, we can calculate that according to our runner in andromeda the current time on Earth is about 9 days later then today.

So simultaneous to the committee sitting around on andromeda, you are just now deciding what to tumblrize. According to the runner, it’s 9 days later and you’ve already posted whatever you are thinking about + dozens of other things.

So how much free will do you really have about what you post? (This argument is originally due to Rietdijk and Putnam).

We are doing Taylor series in calculus and it's really boring. What would you add from physics?

First, sorry I didn’t get to this for so long.

Anyway, there is a phenomenon in physics where almost everything is modeled as a spring (simple harmonic motion is everywhere!). You can see this in discussion of resonances. Wave motion can be understood as springs coupled together,etc, and lots of system exhibit waves- when you speak the tiny air perurbations travel out like waves, same as throwing a pebble in a pond, or wiggling a jump rope. These are all very different systems, so why the hell do we see such similar behavior?

Why would this be? Well, think of a system in equilibrium, and nudging it a tiny bit away from equilibrium. If the equilibrium is at some parameter a, and we nudge it a tiny bit away from equilibrium (so x-a = epsilon)

[E(x-a)]

Now, we can Taylor expand- but we note that in equiilibrium the energy is at a minimum, so the linear term in the Taylor expansion is 0

[E(\epsilon)= E(a) + \frac{d^2E}{dx^2}1/2 \epsilon^2 + … ]

Now, constants in potential energy don’t matter, and so the first important term is a squared potential energy, which is a spring.

So Taylor series-> everything is a spring.

Why field theory?

So far, we’ve learned in earlier posts in my quantum category that

  1. Classical theories can be described in terms of paths with rules where particles take the path of “least action.”

  2. We can turn a classical theory into a quantum one by having the particle take every path, with the phase from each path given by the action along the path (divided by hbar).

We’ve also learned that we can make a classical theory comply with special relativity by picking a relativistic action (in particle, an action proportional to the “proper time.”)

So one obvious thing to try to make a special relativistic quantum theory would be to start with a special relativistic action and do the sum over paths we use for the classical theory.

You can do this- and it almost works! If you do the mathematical transition from our original, non-relativistic paths to standard, textbook quantum you’d find that you get the Schroedinger equation (or if you were more sophisticated you could get something called the Pauli equation that no one talks about, but is basically the Schroedinger equation + the fact that electrons have spin).

If you try to do it from a relativistic action, you would get an equation called the Klein-Gordon equation (or if you were more sophisticated you could get the Dirac equation). Unfortunately, this runs into trouble- there can be weird negative probabilities, and general weirdness to the solutions.

So we have done something wrong- and the answer is that making the action special relativistic invariant isn’t enough.

Let’s look at some paths:

So the dotted line in this picture represents the light cone- how fast light traveling away from the point will travel. All of the paths end up inside the light cone, but some of the paths go outside of it. This leads to really strange situations, lets look at one outside the light cone path from two frames of reference:

So what we see is that a normal path in the first frame (on the left) looks really strange in the second- because the order of events for events outside the lightcone isn’t fixed, some frame of references see the path as moving back in time.

So immediately we see the problem. When we switched to the relativistic theory we weren’t including all the paths- to really include all the paths we need to include paths that also (apparently) move back in time. This is very strange! Notice that if we run time forward the X’ observer sees ,at some points along the path two particles (one moving back in time, one moving forward).

Feynman’s genius was to demonstrate that we can think of these particles moving backward in time as anti-particles moving forward in time. So the x’ observer

So really our path set looks like

Notice that not only do we have paths connecting the two points, but we have totally unrelated loops that start and end at the same points- these paths are possible now!

So to calculate a probability, we can’t just look at the paths that connect paths connecting points x_o and x_1! There can be weird loopy paths that never touch x_o and x_1 that still matter! From Feynman’s persepctive, particle and anti particle pairs can form, travel awhile and annihilate later.

So as a book keeping device we introduce a field- at every point in space it has a value. To calculate the action of the field we can’t just look at the paths- instead we have to sum up the values of the fields (and some derivatives) at every point in space.

So our old action was a sum of the action over just times (S is the action, L is the lagrangian)

[S = \int dt L ]

Our new action has to be a sum over space and time.

[S = \int dt d^x l ]

So now our Lagrangian is a lagrangian density.

And we can’t just restrict ourselves to paths- we have to add up every possible configuration of the field.

So that’s why we need field theory to combine relativity with quantum mechanics. Next time some implications

Field theory implications

So the first thing is that if we take the Feynman interpretation, our field theory doesn’t have a fixed particle number- depending on the weird loops in a configuration it could have an almost arbitrary number of particles. So one way to phrase the problem with not including backwards paths is that we need to allow the particle number to fluctuate.

Also, I know some of you are thinking “what are these fields?” Well- that’s not so strange. Think of the electromagnetic fields. If you have no charges around, what are the solutions to the electromagnetic field? They are just light waves. Remember this post? Remember that certain special paths were the most important for the sum over all paths? Similarly, certain field configurations are the most important for the sum over configurations. Those are the solutions to the classical field theory.

So if we start with EM field theory, with no charges, then the most important solutions are photons (the light waves). So we can outline levels of approximation

Sum over all configurations -> (semi classical) photons that travel all paths -> (fully classical) particles that travel just the classical path.

Similarly, with any particle

Sum over all configurations -> (semi classical) particles that travel all paths -> (fully classical) particles that travel just the classical path.

This is why most quantum mechanics classes really only cover wave mechanics and don’t ever get fully quantum mechanical.

Planck length/time

Answering somervta's question. What is the significance of Planck units.

Let’s start with an easier one where we have some intuition- let’s analyze the simple hydrogen atom (the go-to quantum mechanics problem). But instead of doing physics, lets just do dimensional analysis- how big do we expect hydrogen energies to be?

Let’s start with something simpler- what sort of distances do we expect a hydrogen atom to have? How big should it’s radius be?

Well, first- what physics is involved? I model the hydrogen atom as an electron moving in an electric field, and I expect I’ll need quantum mechanics, so I’ll need hbar (planck’s constant), e, the charge of the electron, coulomb’s constant (call it k), and the mass of the electron. Can I turn these into a length?

Let’s give it a try- k*e^2 is an energy times a length. hbar is an energy * a time, so if we divide we can get hbar/(k*e^2) which has units of time/length. Multiply in by another hbar, and we get hbar^2/(k*e^2), which has units of mass * length. So divide by the mass of the electron, and we get a quantity hbar^2/(m*k*e^2).

This has units of length, so we might guess that the important length scale for the hydrogen atom is our quantity (this has a value of about 53 picometers, which is about the right scale for atomic hyrdogen).

We could also estimate the energy of the hydrogen atom by noting that

Energy ~ k*e^2/r and use our scale for r.

Energy ~ m*k^2*e^4/(hbar^2) ~27 eV.

This is about twice as large as the actual ground state, but its definitely the right order of magnitude.

Now what Planck noticed is that if you ask “what are the length scales of quantum gravity?” You end up with the constants G, c, and hbar. Turns out, you can make a length scale out of that (sqrt (hbar*G/c^3) ) So just like with hydrogen, we expect that gives us a characteristic length for where quantum effects might start to matter for gravity (or gravity effects might matter for quantum mechanics).

The planck energy and planck mass, then, are similarly characteristic mass and energy scales.

It’s sort of “how small do my lengths have to be before quantum gravity might matter?” But it’s just a guess, really. Planck energy is the energy you’d need to probe that sort of length scale (higher energies probe smaller lengths),etc.

Does that answer your question?

More Physics Answers

Answering bgaesop's question:

How is the whole dark matter/dark energy thing not just proof that the theory of universal gravitation is wrong?

So let’s start with dark energy- the first thing to note is that dark energy isn’t really new, as an idea it goes back to Einstein’s cosmological constant. When the cosmological implications of general relativity were first being understood, Einstein hated that it looked like the universe couldn’t be stable. BUT then he noticed that his field equations weren’t totally general- he could add a term, a constant. When Hubble first noticed that the universe was expanding Einstein dropped the constant, but in the fully general equation it was always there. There has never been a good argument why it should be zero (though some theories (like super symmetry) were introduced in part to force the constant to 0, back when everyone thought it was 0).

Dark energy really just means that constant has a non-zero value. Now, we don’t know why it should be non-zero. That’s a responsibility for a deeper theory- as far as GR goes it’s just some constant in the equation.

As for dark matter, that’s more complicated. The original observations were that you couldn’t make galactic rotation curves work out correctly with just the observable matter. So some people said “maybe there is a new type of non-interacting matter” and other people said “let’s modify gravity! Changing the theory a bit could fix the curves, and the scale is so big you might not notice the modifications to the theory.”

So we have two competing theories, and we need a good way to tell them apart. Some clever scientists got the idea to look at two galaxies that collided- the idea was the normal matter would smash together and get stuck at the center of the collision, but the dark matter would pass right through. So you would see two big blobs of dark matter moving away from each other (you can infer their presence from the way the heavy matter bends light, gravitational lensing), and a clump of visible matter in between. In the bullet cluster, we see exactly that.

Now, you can still try to modify gravitation to match the results, but the theories you get start to look pretty bizarre, and I don’t think any modified theory has worked successfully (though the dark matter interpretation is pretty natural).

In the standard model, what are the fundamental "beables" (things that exist) and what are kinds of properties do they have (that is, not "how much mass do they have" but "they have mass")?

So this one is pretty tough, because I don’t think we know for sure exactly what the “beables” are (assuming you are using beable like Bell’s term).

The issue is that field theory is formulated in terms of potentials- the fields that enter into the action are the electromagnetic potential, not the electromagnetic field. In classical electromagnetic theory, we might say the electromagnetic field is a beable (Bell’s example), but the potential is not.

But in field theory we calculate everything in terms of potentials- and we consider certain states of the potential to be “photons.”

At the electron level, we have a field configuration that is more general than the wavefunction - different configurations represent different combinations of wavefunctions (one configuration might represent a certain 3 particle wavefunction, another might represent a single particle wavefunction,etc).

In Bohm type theories, the beables are the actual particle positions, and we could do something like that for field theory- assume the fields are just book keeping devices. This runs into problems though, because field configurations that don’t look much like particles are possible, and can have an impact on your theory. So you want to give some reality to the fields.

Another issue is that the field configurations themselves aren’t unique- symmetries relate different field configurations so that very different configurations imply the same physical state.

A lot of this goes back to the fact that we don’t have a realistic axiomatic field theory yet.

But for concreteness sake, assume the fields are “real,” then you have fermion fields, which have a spin of 1/2, an electro-weak charge, a strong charge, and a coupling to the higgs field. These represent right or left handed electrons,muons,neutrinos,etc.

You have gauge-fields (strong field, electro-weak field), these represent your force carrying boson (photons, W,Z bosons, gluons).

And you have a Higgs field, which has a coupling to the electroweak field, and it has the property of being non-zero everywhere in space, and that constant value is called its vacuum expectation value.

What's the straight dope on dark matter candidates?

So, first off there are two types of potential dark matter. Hot dark matter, and cold dark matter. One obvious form of dark matter would be neutrinos- they only interact weakly and we know they exist! So this seems very obvious and promising until you work it out. Because neutrinos are so light (near massless), most of them will be traveling at very near the speed of light. This is “hot” dark matter and it doesn’t have the right properties.

So what we really want is cold dark matter. I think astronomers have some ideas for normal baryonic dark matter (brown dwarfs or something). I don’t know as much about those.

Particle physicists instead like to talk about what we call thermal relics. Way back in the early universe, when things were dense and hot, particles would be interconverting between various types (electron-positrons turning into quarks, turning into whatever). As the universe cooled, at some point the electro-weak force would split into the weak and electric force, and some of the weak particles would “freeze out.” We can calculate this and it turns out the density of hypothetical “weak force freeze out” particles would be really close to the density of dark matter. These are called thermal relics. So what we want are particles that interact via the weak force (so the thermal relics have the right density) and are heavier than neutrinos (so they aren’t too hot).

From SUSY

It turns out it’s basically way too easy to create these sorts of models. There are lots of different super-symmetry models but all of them produce heavy “super partners” for every existing particle. So one thing you can do is assume super symmetry and then add one additional symmetry (they usually pick R-parity) the goal of the additional symmetry is to keep the lightest super partner from decaying. So usually the lightest partner is related to the weak force (generally its a partner to some combination of the Higgs, the Z bosons, and the photons. Since these all have the same quantum numbers they mix into different mass states). These are called neutralinos. Because they are superpartners to weakly interacting particles they will be weakly interacting, and they were forced to be stable by R parity. So BAM, dark matter candidate.

Of course, we’ve never seen any super-partners,so…

From GUTs

Other dark matter candidates can come from grand unified theories. The standard model is a bit strange- the Higgs field ties together two different particles to make the fermions (left handed electron + right handed electron, etc). The exception to this rule are neutrinos. Only left handed neutrinos exist, and their mass is Majorana.

But some people have noticed that if you add a right handed neutrino, you can do some interesting things- the first is that with a right handed neutrino in every generation you can embed each generation very cleanly in SO(10). Without the extra neutrino, you can embed in SU(5) but it’s a bit uglier. This has the added advantage that SO groups generally don’t have gauge anomalies.

The other thing is that if this neutrino is heavy, then you can explain why the other fermion masses are so light via a see-saw mechanism.

Now, SO(10) predicts this right handed neutrino doesn’t interact via the standard model forces, but because the gauge group is larger we have a lot more forces/bosons from the broken GUT. These extra bosons almost always lead to trouble with proton decay, so you have to figure out some way to arrange things so that protons are stable, but you can still make enough sterile neutrinos in the early universe to account for dark matter. I think there is enough freedom to make this mostly work, although the newer LHC constraints probably make that a bit tougher.

Obviously we’ve not seen any of the additional bosons of the GUT, or proton decay,etc.

From Axions

(note: the method for axion production is a bit different than other thermal relics)

There is a genuine puzzle to the standard model QCD/SU(3) gauge theory. When the theory was first designed physicist used the most general lagrangian consistent with CP symmetry. But the weak force violates CP, so CP is clearly not a good symmetry. Why then don’t we need to include the CP violating term in QCD?

So Peccei and Quinn were like “huh, maybe the term should be there, but look we can add a new field that couples to the CP violating term, and then add some symmetries to force the field to near 0.″ That would be fine, but the symmetry would have an associated goldstone boson, and we’d have spotted a massless particle.

So you promote the global Peccei-Quinn symmetry to a guage symmetry, and then the goldston boson becomes massive, and you’ve saved the day. But you’ve got this leftover massive “axion” particle. So BAM dark matter candidate.

Like all the other dark matter candidates, this has problems. There are instanton solutions to QCD, and those would break the Peccei-Quinn symmetry. Try to fix it and you ruin the gauge symmetry (and so your back to a global symmetry and a massless, ruled-out axion). So it’s not an exact symmetry, and things get a little strained.

So these are the large families I can think of off hand. You can combine the different ones (SUSY SU(5) GUT particles,etc).

I realize this will be very hard to follow without much background, so if other people are interested, ask specific questions and I can try to clean up the specifics.

Also, I have a gauge theory post for my quantum sequence that will be going up soon.

If your results are highly counterintuitive...

They are almost certainly wrong.

Once, when I was a young, naive data science I embarked on a project to look at individual claims handlers and how effective they were. How many claims did they manage to settle below the expected cost? How many claims were properly reserved? Basically, how well was risk managed?

And I discovered something amazing! Several of the most junior people in the department were fantastic, nearly perfect on all metrics. Several of the most senior people had performance all over the map. They were significantly below average on most metrics! Most of the claims money was spent on these underperformers! Big data had proven that a whole department in a company was nonsense lunacy!

Not so fast. Anyone with any insurance experience (or half a brain, or less of an arrogant physics-is-the-best mentality) would have realized something right away- the kinds of claims handled by junior people are going to be different. Everything that a manager thought could be handled easily by someone fresh to the business went to the new guys. Simple cases, no headaches, assess the cost, pay the cost, done.

Cases with lots of complications (maybe uncertain liability, weird accidents, etc) went to the senior people. Of course outcomes looked worse, more variance per claim makes the risk much harder to manage. I was the idiot, and misinterpreting my own results!

A second example occured with a health insurance company where an employee I supervised thought he’d upended medicine when he discovered a standard-of-care chemo regiment lead to worse outcomes then a much less common/”lighter” alternative. Having learned from my first experience, I dug into the data with him and we found out that the only cases where the less common alternative was used were cases where the cancer had been caught early and surgically removed while it was localized.

Since this experience, I’ve talked to startups looking to hire me, and startups looking for investment (and sometimes big-data companies looking to be hired by companies I work for), and I see this mistake over and over. “Look at this amazing counterintuitive big data result!”

The latest was in a trade magazine where some new company claimed that a strip-mall lawyer with 22 wins against some judge was necessarily better than white-shoe law firm that won less often against the same judge. (Although in most companies I have worked for, if the case even got to trial something has gone wrong- everyone pushes for settlement. So judging by trial win record is silly for a second reason).

Locality, fields and the crown jewel of modern physics

Apologies, this post is not finished. I will edit to replace the to be continued section soon.

Last time, we talked about the need for a field theory associating a mathematical field with any point in space. Today, we are going to talk about what our fields might look like. And we’ll find something surprising!

I also want to emphasize locality, so in order to do that let’s consider our space time as a lattice, instead of the usual continuous space.

So that is a lattice. Now imagine that it’s 4 dimensional instead of 2 dimensional.

Now, a field configuration involves putting one of our phasors at every point in space.

So here is a field configuration:

To make our action local (and thus consistent with special relativity) we insist that the action at one lattice point only depends on the field at that point, and on the fields of the neighboring points.

We also need to make sure we keep the symmetry we know from earlier posts- we know that the amplitude of the phasor is what matters, and we have the symmetry to change the phase angle.

Neighbors of the central point, indicated by dotted lines.

We can compare neighboring points by subtracting (taking a derivative).

Sorry that is blurry. . Middle phasor - left phasor = some other phasor.

And the last thing we need to capture is the symmetry-remember that the angle of our phasor didn’t matter for predictions- the probabilities are all related to amplitudes (the length of the phasor). The simplest way to do this is to insist that we adjust the angle of all the phasors in the field, everywhere:

Sorry for the shadow of my hand

Anyway, this image shows a transformation of all the phasors. This works, but it seems weird- consider a configuration like this:

This is two separate localized field configurations- we might interpret this as two particles. But should we really have to adjust the phase angle of the all the fields over by the right particle if we are doing experiments only on the left particle?

Maybe what we really want is a local symmetry. A symmetry where we can rotate the phase angle of a phasor at any point individually (and all of them differently, if we like).

To Be Continued

Sampling v. tracing

2016-01-24 08:00:00

Perf is probably the most widely used general purpose performance debugging tool on Linux. There are multiple contenders for the #2 spot, and, like perf, they're sampling profilers. Sampling profilers are great. They tend to be easy-to-use and low-overhead compared to most alternatives. However, there are large classes of performance problems sampling profilers can't debug effectively, and those problems are becoming more important.

For example, consider a Google search query. Below, we have a diagram of how a query is carried out. Each of the black boxes is a rack of machines and each line shows a remote procedure call (RPC) from one machine to another.

The diagram shows a single search query coming in, which issues RPCs to over a hundred machines (shown in green), each of which delivers another set of requests to the next, lower level (shown in blue). Each request at that lower level also issues a set of RPCs, which aren't shown because there's too much going on to effectively visualize. At that last leaf level, the machines do 1ms-2ms of work, and respond with the result, which gets propagated and merged on the way back, until the search result is assembled. While that's happening, on any leaf machine, 20-100 other search queries will touch the same machine. A single query might touch a couple thousand machines to get its results. If we look at the latency distribution for RPCs, we'd expect that with that many RPCs, any particular query will see a 99%-ile worst case (tail) latency; and much worse than mere 99%-ile, actually.

That latency translates directly into money. It's now well established that adding user latency reduces ad clicks, reduces the odds that a user will complete a transaction and buy something, reduces the odds that a user will come back later and become a repeat customer, etc. Over the past ten to fifteen years, the understanding that tail latency is an important factor in determining user latency, and that user latency translates directly to money, has trickled out from large companies like Google into the general consciousness. But debugging tools haven't kept up.

Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average.

For more on this, let's look at this wide ranging Dick Sites talk1 which covers, among other things, the performance tracing framework that Dick and others have created at Google. By capturing “every” event that happens, it lets us easily debug performance oddities that would otherwise be difficult to track down. We'll take a look at three different bugs to get an idea about the kinds of problems Google's tracing framework is useful for.

First, we can look at another view of the search query we just saw above: given a top-level query that issues some number of RPCs, how long does it take to get responses?

Time goes from left to right. Each row is one RPC, with the blue bar showing when the RPC was issued and when it finished. We can see that the first RPC is issued and returns before 93 other RPCs go out. When the last of those 93 RPCs is done, the search result is returned. We can see that two of the RPCs take substantially longer than the rest; the slowest RPC gates the result of the search query.

To debug this problem, we want a couple things. Because the vast majority of RPCs in a slow query are normal, and only a couple are slow, we need something that does more than just show aggregates, like a sampling profiler would. We need something that will show us specifically what's going on in the slow RPCs. Furthermore, because weird performance events may be hard to reproduce, we want something that's cheap enough that we can run it all the time, allowing us to look at any particular case of bad performance in retrospect. In the talk, Dick Sites mentions having a budget of about 1% of CPU for the tracing framework they have.

In addition, we want a tool that has time-granularity that's much shorter than the granularity of the thing we're debugging. Sampling profilers typically run at something like 1 kHz (1 ms between samples), which gives little insight into what happens in a one-time event, like an slow RPC that still executes in under 1ms. There are tools that will display what looks like a trace from the output of a sampling profiler, but the resolution is so poor that these tools provide no insight into most performance problems. While it's possible to crank up the sampling rate on something like perf, you can't get as much resolution as we need for the problems we're going to look at.

Getting back to the framework, to debug something like this, we might want to look at a much more zoomed in view. Here's an example with not much going on (just tcpdump and some packet processing with recvmsg), just to illustrate what we can see when we zoom in.

The horizontal axis is time, and each row shows what a CPU is executing. The different colors indicate that different things are running. The really tall slices are kernel mode execution, the thin black line is the idle process, and the medium height slices are user mode execution. We can see that CPU0 is mostly handling incoming network traffic in a user mode process, with 18 switches into kernel mode. CPU1 is maybe half idle, with a lot of jumps into kernel mode, doing interrupt processing for tcpdump. CPU2 is almost totally idle, except for a brief chunk when a timer interrupt fires.

What's happening is that every time a packet comes in, an interrupt is triggered to notify tcpdump about the packet. The packet is then delivered to the process that called recvmsg on CPU0. Note that running tcpdump isn't cheap, and it actually consumes 7% of a server if you turn it on when the server is running at full load. This only dumps network traffic, and it's already at 7x the budget we have for tracing everything! If we were to look at this in detail, we'd see that Linux's TCP/IP stack has a large instruction footprint, and workloads like tcpdump will consistently come in and wipe that out of the l1i and l2 caches.

Anyway, now that we've seen a simple example of what it looks like when we zoom in on a trace, let's look at how we can debug the slow RPC we were looking at before.

We have two views of a trace of one machine here. At the top, there's one row per CPU, and at the bottom there's one row per RPC. Looking at the top set, we can see that there are some bits where individual CPUs are idle, but that the CPUs are mostly quite busy. Looking at the bottom set, we can see parts of 40 different searches, most of which take around 50us, with the exception of a few that take much longer, like the one pinned between the red arrows.

We can also look at a trace of the same timeframe by which locks are behind held and which threads are executing. The arcs between the threads and the locks show when a particular thread is blocked, waiting on a particular lock. If we look at this, we can see that the time spent waiting for locks is sometimes much longer than the time spent actually executing anything. The thread pinned between the arrows is the same thread that's executing that slow RPC. It's a little hard to see what's going on here, so let's focus on that single slow RPC.

We can see that this RPC spends very little time executing and a lot of time waiting. We can also see that we'd have a pretty hard time trying to find the cause of the waiting with traditional performance measurement tools. According to stackoverflow, you should use a sampling profiler! But tools like OProfile are useless since they'll only tell us what's going on when our RPC is actively executing. What we really care about is what our thread is blocked on and why.

Instead of following the advice from stackoverflow, let's look at the second view of this again.

We can see that, not only is this RPC spending most of its time waiting for locks, it's actually spending most of its time waiting for the same lock, with only a short chunk of execution time between the waiting. With this, we can look at the cause of the long wait for a lock. Additionally, if we zoom in on the period between waiting for the two locks, we can see something curious.

It takes 50us for the thread to start executing after it gets scheduled. Note that the wait time is substantially longer than the execution time. The waiting is because an affinity policy was set which will cause the scheduler to try to schedule the thread back to the same core so that any data that's in the core's cache will still be there, giving you the best possible cache locality, which means that the thread will have to wait until the previously scheduled thread finishes. That makes intuitive sense, but if consider, for example, a 2.2GHz Skylake, the cache latency is 6.4ns, and 21.2ns to l2, and l3 cache, respectively. Is it worth changing the affinity policy to speed this kind of thing up? You can't tell from this single trace, but with the tracing framework used to generate this data, you could do the math to figure out if you should change the policy.

In the talk, Dick notes that, given the actual working set size, it would be worth waiting up to 10us to schedule on another CPU sharing the same l2 cache, and 100us to schedule on another CPU sharing the same l3 cache2.

Something else you can observe from this trace is that, if you care about a workload that resembles Google search, basically every standard benchmark out there is bad, and the standard technique of running N copies of spec is terrible. That's not a straw man. People still do that in academic papers today, and some chip companies use SPEC to benchmark their mobile devices!

Anyway, that was one performance issue where we were able to see what was going on because of the ability to see a number of different things at the same time (CPU scheduling, thread scheduling, and locks). Let's look at a simpler single-threaded example on a single machine where a tracing framework is still beneficial:

This is a trace from gmail, circa 2004. Each row shows the processing that it takes to handle one email. Well, except for the last 5 rows; the last email shown takes so long to process that displaying all of the processing takes 5 rows of space. If we look at each of the normal emails, they all look approximately the same in terms of what colors (i.e., what functions) are called and how much time they take. The last one is different. It starts the same as all the others, but then all this other junk appears that only happens in the slow email.

The email itself isn't the problem -- all of that extra junk is the processing that's done to reindex the words from the emails that had just come in, which was batched up across multiple emails. This picture caused the Gmail devs to move that batch work to another thread, reducing tail latency from 1800ms to 100ms. This is another performance bug that it would be very difficult to track down with standard profiling tools. I've often wondered why email almost always appears quickly when I send to gmail from gmail, and it sometimes takes minutes when I send work email from outlook to outlook. My guess is that a major cause is that it's much harder for the outlook devs to track down tail latency bugs like this than it is for the gmail devs to do the same thing.

Let's look at one last performance bug before moving on to discussing what kind of visibility we need to track these down. This is a bit of a spoiler, but with this bug, it's going to be critical to see what the entire machine is doing at any given time.

This is a histogram of disk latencies on storage machines for a 64kB read, in ms. There are two sets of peaks in this graph. The ones that make sense, on the left in blue, and the ones that don't, on the right in red.

Going from left to right on the peaks that make sense, first there's the peak at 0ms for things that are cached in RAM. Next, there's a peak at 3ms. That's way too fast for the 7200rpm disks we have to transfer 64kB; the time to get a random point under the head is already (1/(7200/60)) / 2 s = 4ms. That must be the time it takes to transfer something from the disk's cache over PCIe. The next peak, at near 25ms, is the time it takes to seek to a point and then read 64kB off the disk.

Those numbers don't look so bad, but the 99%-ile latency is a whopping 696ms, and there are peaks at 250ms, 500ms, 750ms, 1000ms, etc. And these are all unreproducible -- if you go back and read a slow block again, or even replay the same sequence of reads, the slow reads are (usually) fast. That's weird! What could possibly cause delays that long? In the talk, Dick Sites says “each of you think of a guess, and you'll find you're all wrong”.

That's a trace of thirteen disks in a machine. The blue blocks are reads, and the red blocks are writes. The black lines show the time from the initiation of a transaction by the CPU until the transaction is completed. There are some black lines without blocks because some of the transactions hit in a cache and don't require actual disk activity. If we wait for a period where we can see tail latency and zoom in a bit, we'll see this:

We can see that there's a period where things are normal, and then some kind of phase transition into a period where there are 250ms gaps (4) between periods of disk activity (5) on the machine for all disks. This goes on for nine minutes. And then there's a phase transition and disk latencies go back to normal. That it's machine wide and not disk specific is a huge clue.

Using that information, Dick pinged various folks about what could possibly cause periodic delays that are a multiple of 250ms on an entire machine, and found out that the cause was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren't too many requests in a quarter second interval and the kernel stops throttling the threads.

After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years3. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn't something you can see in a profile.

One question you might have is, is this because of some flaw in existing profilers, or can profilers provide enough information that you don't need to use tracing tools to track down rare, long-tail, performance bugs? I've been talking to Xi Yang about this, who had an ISCA 2015 paper and talk describing some of his work. He and his collaborators have done a lot more since publishing the paper, but the paper still contains great information on how far a profiling tool can be pushed. As Xi explains in his talk, one of the fundamental limits of a sampling profiler is how often you can sample.

This is a graph of the number of the number of executed instructions per clock (IPC) over time in Lucene, which is the core of Elasticsearch.

At 1kHz, which is the default sampling interval for perf, you basically can't see that anything changes over time at all. At 100kHz, which is as fast as perf runs, you can tell something is going on, but not what. The 10MHz graph is labeled SHIM because that's the name of the tool presented in the paper. At 10MHz, you get a much better picture of what's going on (although it's worth noting that 10MHz is substantially lower resolution than you can get out of some tracing frameworks).

If we look at the IPC in different methods, we can see that we're losing a lot of information at the slower sampling rates:

This is the top 10 hottest methods Lucene ranked by execution time; these 10 methods account for 74% of the total execution time. With perf, it's hard to tell which methods have low IPC, i.e., which methods are spending time stalled. But with SHIM, we can clearly see that there's one method that spends a lot of time waiting, #4.

In retrospect, there's nothing surprising about these graphs. We know from the Nyquist theorem that, to observe a signal with some frequency, X, we have to sample with a rate at least 2X. There are a lot of factors of performance that have a frequency higher than 1kHz (e.g., CPU p-state changes), so we should expect that we're unable to directly observe a lot of things that affect performance with perf or other traditional sampling profilers. If we care about microbenchmarks, we can get around this by repeatedly sampling the same thing over and over again, but for rare or one-off events, it may be hard or impossible to do that.

This raises a few questions:

  1. Why does perf sample so infrequently?
  2. How does SHIM get around the limitations of perf?
  3. Why are sampling profilers dominant?

1. Why does perf sample so infrequently?

This comment from events/core.c in the linux kernel explains the limit:

perf samples are done in some very critical code paths (NMIs). If they get too much CPU time, the system can lock up and not get any real work done.

As we saw from the tcpdump trace in the Dick Sites talk, interrupts take a significant amount of time to get processed, which limits the rate at which you can sample with an interrupt based sampling mechanism.

2. How does SHIM get around the limitations of perf?

Instead of having an interrupt come in periodically, like perf, SHIM instruments the runtime so that it periodically runs a code snippet that can squirrel away relevant information. In particular, the authors instrumented the Jikes RVM, which injects yield points into every method prologue, method epilogue, and loop back edge. At a high level, injecting a code snippet into every function prologue and epilogue sounds similar to what Dick Sites describes in his talk.

The details are different, and I recommend both watching the Dick Sites talk and reading the Yang et al. paper if you're interested in performance measurement, but the fundamental similarity is that both of them decided that it's too expensive to having another thread break in and sample periodically, so they both ended up injecting some kind of tracing code into the normal execution stream.

It's worth noting that sampling, at any frequency, is going to miss waiting on (for example) software locks. Dick Sites's recommendation for this is to timestamp based on wall clock (not CPU clock), and then try to find the underlying causes of unusually long waits.

3. Why are sampling profilers dominant?

We've seen that Google's tracing framework allows us to debug performance problems that we'd never be able to catch with traditional sampling profilers, while also collecting the data that sampling profilers collect. From the outside, SHIM looks like a high-frequency sampling profiler, but it does so by acting like a tracing tool. Even perf is getting support for low-overhead tracing. Intel added hardware support for certain types for certain types of tracing in Broadwell and Skylake, along with kernel support in 4.1 (with user mode support for perf coming in 4.3). If you're wondering how much overhead these tools have, Andi Kleen claims that the Intel tracing support in Linux has about a 5% overhead, and Dick Sites mentions in the talk that they have a budget of about 1% overhead.

It's clear that state-of-the-art profilers are going to look a lot like tracing tools in the future, but if we look at the state of things today, the easiest options are all classical profilers. You can fire up a profiler like perf and it will tell you approximately how much time various methods are taking. With other basic tooling, you can tell what's consuming memory. Between those two numbers, you can solve the majority of performance issues. Building out something like Google's performance tracing framework is non-trivial, and cobbling together existing publicly available tools to trace performance problems is a rough experience. You can see one example of this when Marek Majkowski debugged a tail latency issue using System Tap.

In Brendan Gregg's page on Linux tracers, he says “[perf_events] can do many things, but if I had to recommend you learn just one [tool], it would be CPU profiling”. Tracing tools are cumbersome enough that his top recommendation on his page about tracing tools is to learn a profiling tool!

Now what?

If you want to use an tracing tool like the one we looked at today your options are:

  1. Get a job at Google
  2. Build it yourself
  3. Cobble together what you need out of existing tools

1. Get a job at Google

I hear Steve Yegge has good advice on how to do this. If you go this route, try to attend orientation in Mountain View. They have the best orientation.

2. Build it yourself

If you look at the SHIM paper, there's a lot of cleverness built-in to get really fine-grained information while minimizing overhead. I think their approach is really neat, but considering the current state of things, you can get a pretty substantial improvement without much cleverness. Fundamentally, all you really need is some way to inject your tracing code at the appropriate points, some number of bits for a timestamp, plus a handful of bits to store the event.

Say you want trace transitions between user mode and kernel mode. The transitions between waiting and running will tell you what the thread was waiting on (e.g., disk, timer, IPI, etc.). There are maybe 200k transitions per second per core on a busy node. 200k events with a 1% overhead is 50ns per event per core. A cache miss is well over 100 cycles, so our budget is less than one cache miss per event, meaning that each record must fit within a fraction of a cache line. If we have 20 bits of timestamp (RDTSC >> 8 bits, giving ~100ns resolution and 100ms range) and 12 bits of event, that's 4 bytes, or 16 events per cache line. Each core has to have its own buffer to avoid cache contention. To map RDTSC times back to wall clock times, calling gettimeofday along with RDTSC at least every 100ms is sufficient.

Now, say the machine is serving 2000 QPS. That's 20 99%-ile tail events per second and 2 99.9% tail events per second. Since those events are, by definition, unusually long, Dick Sites recommends a window of 30s to 120s to catch those events. If we have 4 bytes per event * 200k events per second * 40 cores, that's about 32MB/s of data. Writing to disk while we're logging is hopeless, so you'll want to store the entire log while tracing, which will be in the range of 1GB to 4GB. That's probably fine for a typical machine in a datacenter, which will have between 128GB and 256GB of RAM.

My not-so-secret secret hope for this post is that someone will take this idea and implement it. That's already happened with at least one blog post idea I've thrown out there, and this seems at least as valuable.

3. Cobble together what you need out of existing tools

If you don't have a magical framework that solves all your problems, the tool you want is going to depend on the problem you're trying to solve.

For figuring out why things are waiting, Brendan Gregg's write-up on off-CPU flame graphs is a pretty good start if you don't have access to internal Google tools. For that matter, his entire site is great if you're doing any kind of Linux performance analysis. There's info on Dtrace, ftrace, SystemTap, etc. Most tools you might use are covered, although PMCTrack is missing.

The problem with all of these is that they're all much higher overhead than the things we've looked at today, so they can't be run in the background to catch and effectively replay any bug that comes along if you operate at scale. Yes, that includes dtrace, which I'm calling out in particular because any time you have one of these discussions, a dtrace troll will come along to say that dtrace has supported that for years. It's like the common lisp of trace tools, in terms of community trolling.

Anyway, if you're on Windows, Bruce Dawson's site seems to be the closest analogue to Bredan Gregg's site. If that doesn't have enough detail, there's always the Windows Internals books.

This is a bit far afield, but for problems where you want an easy way to get CPU performance counters, likwid is nice. It has a much nicer interface than perf stat, lets you easily only get stats for selected functions, etc.

Thanks to Nathan Kurz, Xi Yang, Leah Hanson, John Gossman, Dick Sites, Hari Angepat, and Dan Puttick for comments/corrections/discussion.

P.S. Xi Yang, one of the authors of SHIM is finishing up his PhD soon and is going to be looking for work. If you want to hire a performance wizard, he has a CV and resume here.


  1. The talk is amazing and I recommend watching the talk instead of reading this post. I'm writing this up because I know if someone told me I should watch a talk instead of reading the summary, I wouldn't do it. Ok, fine. If you're like me, maybe you'd consider reading a couple of his papers instead of reading this post. I once heard someone say that it's impossible to disagree with Dick's reasoning. You can disagree with his premises, but if you accept his premises and follow his argument, you have to agree with his conclusions. His presentation is impeccable and his logic is implacable. [return]
  2. This oversimplifies things a bit since, if some level of cache is bandwidth limited, spending bandwidth to move data between cores could slow down other operations more than this operation is sped up by not having to wait. But even that's oversimplified since it doesn't take into account the extra power it takes to move data from a higher level cache as opposed to accessing the local cache. But that's also oversimplified, as is everything in this post. Reality is really complicated, and the more detail we want the less effective sampling profilers are. [return]
  3. This sounds like a long time, but if you ask around you'll hear other versions of this story at every company that creates systems complex beyond human understanding. I know of one chip project at Sun that was delayed for multiple years because they couldn't track down some persistent bugs. At Microsoft, they famously spent two years tracking down a scrolling smoothness bug on Vista. The bug was hard enough to reproduce that they set up screens in the hallways so that they could casually see when the bug struck their test boxes. One clue was that the bug only struck high-end boxes with video cards, not low-end boxes with integrated graphics, but that clue wasn't sufficient to find the bug.

    After quite a while, they called the Xbox team in to use their profiling expertise to set up a system that could capture the bug, and once they had the profiler set up it immediately became apparent what the cause was. This was back in the AGP days, where upstream bandwidth was something like 1/10th downstream bandwidth. When memory would fill up, textures would get ejected, and while doing so, the driver would lock the bus and prevent any other traffic from going through. That took long enough that the video card became unresponsive, resulting in janky scrolling.

    It's really common to hear stories of bugs that can take an unbounded amount of time to debug if the proper tools aren't available.

    [return]

We saw some really bad Intel CPU bugs in 2015 and we should expect to see more in the future

2016-01-10 08:00:00

2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they've been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices1.

Things haven't looked so great on the engineering/bugs side of things, though. We've seen a number of fairly serious CPU bugs and it looks like we should expect more in the future. I don't keep track of Intel bugs unless they're so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.

Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I'm actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at DPC that's vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time2.

Shortly after the Serebrin/Beulic bug was found, a group of people found that running prime95, a commonly used tool for benchmarking and burn-in, causes their entire system to lock up. Intel's response to this was:

Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.

which reveals almost nothing about what's actually going on. If you look at their errata list, you'll find that this is typical, except that they normally won't even name the application that was used to trigger the bug. For example, one of the current errata lists has entries like

  • Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior
  • AVX Gather Instruction That Should Result in #DF May Cause Unexpected System Behavior
  • Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity
  • Page Fault May Report Incorrect Fault Information

As we've seen, “unexpected system behavior” can mean that we're completely screwed. Machine checks aren't great either -- they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.

And keep in mind that the Intel errata list has the following disclaimer:

Errata remain in the specification update throughout the product's lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.

Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won't be able to find out what errata your older stepping has unless you're important enough to Intel.

Anyway, back to 2015. We've seen at least two serious bugs in Intel CPUs in the last quarter3, and it's almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who'd previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we'll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it's much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that. In part, that's because "unpredictable system behavior" have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it's mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort. Ironically, we have hardware virtualization that's supposed to help us with security, but the virtualization is so complicated4 that the hardware virtualization implementation is likely to expose "unpredictable system behavior" bugs that wouldn't otherwise have existed. This isn't to say it's hopeless -- it's possible, in principle, to design CPUs such that a hang bug on one core doesn't crash the entire system. It's just that it's a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one's done the work because it hasn't previously seemed important.

You'll often hear software folks say that these things don't matter because they can (sometimes) be patched. But, many devices will never get patched, which means that hardware security bugs will leave some devices vulnerable for their entire lifetime. And even if you don't care about consumers, serious bugs are very bad for CPU vendors. At a company I worked for, we once had a bug escape validation and get found after we shipped. One OEM wouldn't talk to us for something like five years after that, and other OEMs that continued working with us had to re-qualify their parts with our microcode patch and they made sure to let us know how expensive that was. Intel has enough weight that OEMs can't just walk away from them after a bug, but they don't have unlimited political capital and every serious bug uses up political capital, even if it can be patched.

This isn't to say that we should try to get to zero bugs. There's always going to be a trade off between development speed and and bug rate and the optimal point probably isn't zero bugs. But we're now regularly seeing severe bugs with security implications, which changes the tradeoff a lot. With something like the FDIV bug you can argue that it's statistically unlikely that any particular user who doesn't run numerical analysis code will be impacted, but security bugs are different. Attackers don't run random code, so you can't just say that it's unlikely that some condition will occur.

Update

After writing this, a person claiming to be an ex-Intel employee said "even with your privileged access, you have no idea" and a pseudo-anonymous commenter on reddit made this comment:

As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

Why?

Let me set the scene: It's late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: "We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times... we can't live forever in the shadow of the early 90's FDIV bug, we need to move on. Our competition is moving much faster than we are" - I'm paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn't explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn't seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

I haven't been able to confirm this story from another source I personally know, although another anonymous commenter said "I left INTC in mid 2013. From validation. This ... is accurate compared with my experience." Another anonymous person, someone I know, didn't hear that speech, but found that at around that time, "velocity" became a buzzword and management spent a lot of time talking about how Intel needs more "velocity" to compete with ARM, which appears to confirm the sentiment, if not the actual speech.

I've also heard from formal methods people that, around the-timeframe mentioned in the first comment, there was an exodus of formal verification folks. One story I've heard is that people left because they were worried about being made redundant. I'm told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I've heard is that things got really strange due to Intel's focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it's hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.

I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they'd see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That's not better per-core performance -- that's better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.

There's also a mysterious hang during idle/low-activity bug that Intel doesn't seem to have figured out yet.

And then there's this Broadwell bug that hangs Linux if you don't disable low-power states.

And of course Intel isn't the only company with bugs -- this AMD bug found by Robert Swiecki not only allows a VM to crash its host, it also allows a VM to take over the host.

I doubt I've even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.

More updates

A number of folks have noticed unusual failure rates in storage devices and switches. This appears to be related to an Intel Atom bug. I find this interesting because the Atom is a relatively simple chip, and therefore a relatively simple chip to verify. When the first-gen Atom was released, folks at Intel seemed proud of how few internal spins of the chip were needed to ship a working production chip that, something made possible by the simplicity of the chip. Modern Atoms are more complicated, but not that much more complicated.

Intel Skylake and Kaby Lake have a hyperthreading bug that's so serious that Debian recommends that users disable hyperthreading to avoid the bug, which can "cause spurious errors, such as application and system misbehavior, data corruption, and data loss".

On the AMD side, there might be a bug that's as serious any recent Intel CPU bug. If you read that linked thread, you'll see an AMD representative asking people to disable SMT, OPCache Control, and changing LLC settings to possibly mitigate or narrow down a serious crashing bug. On another thread, you can find someone reporting an #MC exception with "u-op cache crc mismatch".

Although AMD's response in the forum was that these were isolated issues, phoronix was able to reproduce crashes by running a stress test that consists of compiling a number of open source programs. They report they were able to get 53 segfaults with one hour of attempted compilation.

Some FreeBSD folks have also noticed seemingly unrelated crashes and have been able to get a reproduction by running code at a high address and then firing an interrupt. This can result in a hang or a crash. The reason this appears to be unrelated to the first reported Ryzen issues is that this is easily reproducible with SMT disabled.

Matt Dillon found an AMD bug triggered by DragonflyBSD, and commited a tiny patch to fix it:

There is a bug in Ryzen related to the kernel iretq'ing into a high user %rip address near the end of the user address space (top of user stack). This is a temporary workaround for the issue.

The original %rip for sigtramp was 0x00007fffffffffe0. Moving it down to fa0 wasn't sufficient. Moving it down to f00 moved the bug from nearly instant to taking a few hours to reproduce. Moving it down to be0 it took a day to reproduce. Moving it down to 0x00007ffffffffba0 (this commit) survived the overnight test.

Meltdown / spectre update

This is an interesting class of attack that takes advantage of speculative execution plus side channel attacks to leak privileged information into user processes. It seems that at least some of these attacks be done from javascript in the browser.

Regarding the comments in the first couple updates on Intel's attitude towards validation recently, another person claiming to be ex-Intel backs up the statements above:

As a former Intel employee this aligns closely with my experience. I didn't work in validation (actually joined as part of Altera) but velocity is an absolute buzzword and the senior management's approach to complex challenges is sheer panic. Slips in schedules are not tolerated at all - so problems in validation are an existential threat, your project can easily just be canned. Also, because of the size of the company the ways in which quality and completeness are 'acheived' is hugely bureaucratic and rarely reflect true engineering fundamentals.

2024 update

We're approaching a decade since I wrote this post and the serious CPU bugs keep coming. For example, this recent one was found by RAD tools:

Intel Processor Instability Causing Oodle Decompression Failures

We believe that this is a hardware problem which affects primarily Intel 13900K and 14900K processors, less likely 13700, 14700 and other related processors as well. Only a small fraction of those processors will exhibit this behavior. The problem seems to be caused by a combination of BIOS settings and the high clock rates and power usage of these processors, leading to system instability and unpredictable behavior under heavy load ... Any programs which heavily use the processor on many threads may cause crashes or unpredictable behavior. There have been crashes seen in RealBench, CineBench, Prime95, Handbrake, Visual Studio, and more. This problem can also show up as a GPU error message, such as spurious "out of video memory" errors, even though it is caused by the CPU.

One can argue that this is a configuration bug, but from the standpoint of a typical user, all what they observe is that their CPU is causing crashes. And, realistically, Intel knows that their CPUs are shipping into systems with these settings. The mitigation for this involves doing things like changing the following settings ""SVID behavior" → "Intel fail safe", "Long duration power limit" → reduce to 125W if set higher ("Processor Base Power" on ARK)", "Short duration power limit" → reduce to 253W if set higher (for 13900/14900 CPUs, other CPUs have other limits! "Maximum Turbo Power" on ARK)", etc.

If they wanted their CPUs to not crash due to this issue, they could have and should have enforced these settings as well as some others. Instead, they left this up to the BIOS settings, and here we are.

Historically, Intel was much more serious about verification, validation, and testing than AMD and we saw this in their output. At one point, when a lot of enthusiast sites were excited about AMD (in the K7 days), Google stopped using AMD and basically banned purchases of AMD CPUs because they were so buggy and had caused so many hard-to-debug problems. But, over time, the relative level of verification/validation/test effort Intel allocates has gone down and Intel seems to have nearly caught or maybe caught AMD in their rate of really serious bugs. Considering Intel's current market position, with very heavy pressure from AMD, ARM, and Nvidia, it seems unlikely that Intel will turn this around in the foreseeable future. Nvidia, historically, has been significantly buggier than AMD or Intel, so Intel still has quite a bit of room to run to become the most buggy major chip manufacturer. Considering that Nvidia is one of the biggest threats to Intel and how Intel responded to threats from other, then-buggier, manufacturers, it seems like we should expect an even higher rate of bad bugs in the coming decade.

On the specific bug, there's tremendous pressure to operate more like a "move fast and break things" software company than a traditional, conservative, CPU manufacturer for multiple reasons. When you make a manufacture a CPU, how fast it will run ends up being somewhat random and there's no reliable way to tell how fast it will run other than testing it, so CPU companies run a set of tests on the CPU to see how fast it will go. This test time is actually fairly expensive, so there's a lot of work done to try to find the smallest set of tests possible that will correctly determine how fast the CPU can operate. One easy way to cut costs here is to just run fewer tests even if the smaller set of tests doesn't fully guarantee that the CPU can operate at the speed it's sold at.

Another factor influencing this is that CPUs that are sold as nominally faster can sell for more, so there's also pressure to push the CPUs as close to their limits as possible. One way we can see that the margin here has, in general, decreased, is by looking at how overclockable CPUs are. People are often happy with their overclocked CPU if they run a few tests, like prime95, stresstest, etc., and their part doesn't crash, but this isn't nearly enough to determine if the CPU can really run everything a user could throw at it, but if you really try to seriously test a CPU (working at an Intel competitor, we would do this regularly), Intel and other CPU companies have really pushed the limit of how fast they claim their CPUs are relative to how fast they actually are, which sometimes results in CPUs that are sold that have been pushed beyond their capabilities.

On overclocking, as Fabian Giesen of RAD notes,

This stuff is not sanctioned and will count as overclocking if you try to RMA it but it's sold as a major feature of the platform and review sites test with it on.

Daniel Gibson replied with

hmm on my mainboard (ASUS ROG Strix B550-A Gaming -clearly gaming hardware, but middle price range) I had to explicitly enable the XMP/EXPO profile for the DDR4-RAM to run at full speed - which is DDR4-3200, officially supported by the CPU (Ryzen 5950X). Otherwise it ran at DDR4-2400 speed, I think? Or was it 2133? I forgot, at least significantly lower

To which Fabian noted

Correct. Fun fact: turning on EXPO technically voids your warranty ... t's great; both the CPU and the RAM list it as supported but it's officially not.

One might call it a racket, if one were inclined to such incisive language.

Intel didn't used to officially unofficially support this kind of thing. And, more generally, historically, CPU manufacturers were very hesitant to ship parts that had a non-negligible risk of crashes and data corruption when used as intended if they could avoid them, but more and more of these bugs keep happening. Some end up becoming quite public, like this, due to someone publishing a report about them like the RAD report above. And some get quietly reported to the CPU manufacturer by a huge company, often with some kind of NDA agreement, where the big company gets replacement CPUs and Intel or another manufacturer quietly ships firmware fixes to the issue. And it surely must be the case that some of these aren't really caught at all, unless you count the occasional data corruption or crash as being caught.

CPU internals series

Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, JonLuca De Caro, Jeff Fowler, and a number of anonymous tipsters for comments/corrections/discussion.


  1. As with the Apple, Google, Adobe, etc., wage-fixing agreement, legal systems are sending the clear message that businesses should engage in illegal and unethical behavior since they'll end up getting fined a small fraction of what they gain. This is the opposite of the Becker-ian policy that's applied to individuals, where sentences have gotten jacked up on the theory that, since many criminals aren't caught, the criminals that are caught should have severe punishments applied as a deterrence mechanism. The theory is that the criminals will rationally calculate the expected sentence from a crime, and weigh that against the expected value of a crime. If, for example, the odds of being caught are 1% and we increase the expected sentence from 6 months to 50 years, criminals will calculate that the expected sentence has changed from 2 days to 6 months, thereby reducing the effective value of the crime and causing a reduction in crime. We now have decades of evidence that the theory that long sentences will deter crime is either empirically false or that the effect is very small; turns out that people who impulse commit crimes don't deeply study sentencing guidelines before they commit crimes. Ironically, for white-collar corporate crimes where Becker's theory might more plausibly hold, Becker's theory isn't applied. [return]
  2. Something I find curious is how non-linear the level of effort of the attacks is. Google, Microsoft, and Amazon face regular, persistent, attacks, and if they couldn't trivially mitigate the kind of unsophisticated attack that's been severely affecting Linode availability for weeks, they wouldn't be able to stay in business. If you talk to people at various bay area unicorns, you'll find that a lot of them have accidentally DoS'd themselves when they hit an external API too hard during testing. In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that? [return]
  3. The fault into microcode infinite loop also affects AMD processors, but basically no one runs a cloud on AMD chips. I'm pointing out Intel examples because Intel bugs have higher impact, not because Intel is buggier. Intel has a much better track record on bugs than AMD. IBM is the only major microprocessor company I know of that's been more serious about hardware verification than Intel, but if you have an IBM system running AIX, I could tell you some stories that will make your hair stand on end. Moreover, it's not clear how effective their verification groups can be since they've been losing experienced folks without being able to replace them for over a decade, but that's a topic for another post. [return]
  4. See this code for a simple example of how to use Intel's API for this. The example is simplified, so much so that it's not really useful except as a learning aid, and it still turns out to be around 1000 lines of low-level code. [return]

Normalization of deviance

2015-12-29 08:00:00

Have you ever mentioned something that seems totally normal to you only to be greeted by surprise? Happens to me all the time when I describe something everyone at work thinks is normal. For some reason, my conversation partner's face morphs from pleasant smile to rictus of horror. Here are a few representative examples.

There's the company that is perhaps the nicest place I've ever worked, combining the best parts of Valve and Netflix. The people are amazing and you're given near total freedom to do whatever you want. But as a side effect of the culture, they lose perhaps half of new hires in the first year, some voluntarily and some involuntarily. Totally normal, right? Here are a few more anecdotes that were considered totally normal by people in places I've worked. And often not just normal, but laudable.

There's the company that's incredibly secretive about infrastructure. For example, there's the team that was afraid that, if they reported bugs to their hardware vendor, the bugs would get fixed and their competitors would be able to use the fixes. Solution: request the firmware and fix bugs themselves! More recently, I know a group of folks outside the company who tried to reproduce the algorithm in the paper the company published earlier this year. The group found that they couldn't reproduce the result, and that the algorithm in the paper resulted in an unusual level of instability; when asked about this, one of the authors responded “well, we have some tweaks that didn't make it into the paper” and declined to share the tweaks, i.e., the company purposely published an unreproducible result to avoid giving away the details, as is normal. This company enforces secrecy by having a strict policy of firing leakers. This is introduced at orientation with examples of people who got fired for leaking (e.g., the guy who leaked that a concert was going to happen inside a particular office), and by announcing firings for leaks at the company all hands. The result of those policies is that I know multiple people who are afraid to forward emails about things like updated info on health insurance to a spouse for fear of forwarding the wrong email and getting fired; instead, they use another computer to retype the email and pass it along, or take photos of the email on their phone.

There's the office where I asked one day about the fact that I almost never saw two particular people in the same room together. I was told that they had a feud going back a decade, and that things had actually improved — for years, they literally couldn't be in the same room because one of the two would get too angry and do something regrettable, but things had now cooled to the point where the two could, occasionally, be found in the same wing of the office or even the same room. These weren't just random people, either. They were the two managers of the only two teams in the office.

There's the company whose culture is so odd that, when I sat down to write a post about it, I found that I'd not only written more than for any other single post, but more than all other posts combined (which is well over 100k words now, the length of a moderate book). This is the same company where someone recently explained to me how great it is that, instead of using data to make decisions, we use political connections, and that the idea of making decisions based on data is a myth anyway; no one does that. This is also the company where all four of the things they told me to get me to join were false, and the job ended up being the one thing I specifically said I didn't want to do. When I joined this company, my team didn't use version control for months and it was a real fight to get everyone to use version control. Although I won that fight, I lost the fight to get people to run a build, let alone run tests, before checking in, so the build is broken multiple times per day. When I mentioned that I thought this was a problem for our productivity, I was told that it's fine because it affects everyone equally. Since the only thing that mattered was my stack ranked productivity, so I shouldn't care that it impacts the entire team, the fact that it's normal for everyone means that there's no cause for concern.

There's the company that created multiple massive initiatives to recruit more women into engineering roles, where women still get rejected in recruiter screens for not being technical enough after being asked questions like "was your experience with algorithms or just coding?". I thought that my referral with a very strong recommendation would have prevented that, but it did not.

There's the company where I worked on a four person effort with a multi-hundred million dollar budget and a billion dollar a year impact, where requests for things that cost hundreds of dollars routinely took months or were denied.

You might wonder if I've just worked at places that are unusually screwed up. Sure, the companies are generally considered to be ok places to work and two of them are considered to be among the best places to work, but maybe I've just ended up at places that are overrated. But I have the same experience when I hear stories about how other companies work, even places with stellar engineering reputations, except that it's me that's shocked and my conversation partner who thinks their story is normal.

There's the companies that use @flaky, which includes the vast majority of Python-using SF Bay area unicorns. If you don't know what this is, this is a library that lets you add a Python annotation to those annoying flaky tests that sometimes pass and sometimes fail. When I asked multiple co-workers and former co-workers from three different companies what they thought this did, they all guessed that it re-runs the test multiple times and reports a failure if any of the runs fail. Close, but not quite. It's technically possible to use @flaky for that, but in practice it's used to re-run the test multiple times and reports a pass if any of the runs pass. The company that created @flaky is effectively a storage infrastructure company, and the library is widely used at its biggest competitor.

There's the company with a reputation for having great engineering practices that had 2 9s of reliability last time I checked, for reasons that are entirely predictable from their engineering practices. This is the second thing in a row that can't be deanonymized because multiple companies fit the description. Here, I'm not talking about companies trying to be the next reddit or twitter where it's, apparently, totally fine to have 1 9. I'm talking about companies that sell platforms that other companies rely on, where an outage will cause dependent companies to pause operations for the duration of the outage. Multiple companies that build infrastructure find practices that lead to 2 9s of reliability.

As far as I can tell, what happens at a lot these companies is that they started by concentrating almost totally on product growth. That's completely and totally reasonable, because companies are worth approximately zero when they're founded; they don't bother with things that protect them from losses, like good ops practices or actually having security, because there's nothing to lose (well, except for user data when the inevitable security breach happens, and if you talk to security folks at unicorns you'll know that these happen).

The result is a culture where people are hyper-focused on growth and ignore risk. That culture tends to stick even after company has grown to be worth well over a billion dollars, and the companies have something to lose. Anyone who comes into one of these companies from Google, Amazon, or another place with solid ops practices is shocked. Often, they try to fix things, and then leave when they can't make a dent.

Google probably has the best ops and security practices of any tech company today. It's easy to say that you should take these things as seriously as Google does, but it's instructive to see how they got there. If you look at the codebase, you'll see that various services have names ending in z, as do a curiously large number of variables. I'm told that's because, once upon a time, someone wanted to add monitoring. It wouldn't really be secure to have google.com/somename expose monitoring data, so they added a z. google.com/somenamez. For security. At the company that is now the best in the world at security. They're now so good at security that multiple people I've talked to (all of whom joined after this happened) vehemently deny that this ever happened, even though the reasons they give don't really make sense (e.g., to avoid name collisions) and I have this from sources who were there at the time this happened.

Google didn't go from adding z to the end of names to having the world's best security because someone gave a rousing speech or wrote a convincing essay. They did it after getting embarrassed a few times, which gave people who wanted to do things “right” the leverage to fix fundamental process issues. It's the same story at almost every company I know of that has good practices. Microsoft was a joke in the security world for years, until multiple disastrously bad exploits forced them to get serious about security. This makes it sound simple, but if you talk to people who were there at the time, the change was brutal. Despite a mandate from the top, there was vicious political pushback from people whose position was that the company got to where it was in 2003 without wasting time on practices like security. Why change what's worked?

You can see this kind of thing in every industry. A classic example that tech folks often bring up is hand-washing by doctors and nurses. It's well known that germs exist, and that washing hands properly very strongly reduces the odds of transmitting germs and thereby significantly reduces hospital mortality rates. Despite that, trained doctors and nurses still often don't do it. Interventions are required. Signs reminding people to wash their hands save lives. But when people stand at hand-washing stations to require others walking by to wash their hands, even more lives are saved. People can ignore signs, but they can't ignore being forced to wash their hands.

This mirrors a number of attempts at tech companies to introduce better practices. If you tell people they should do it, that helps a bit. If you enforce better practices via code review, that helps a lot.

The data are clear that humans are really bad at taking the time to do things that are well understood to incontrovertibly reduce the risk of rare but catastrophic events. We will rationalize that taking shortcuts is the right, reasonable thing to do. There's a term for this: the normalization of deviance. It's well studied in a number of other contexts including healthcare, aviation, mechanical engineering, aerospace engineering, and civil engineering, but we don't see it discussed in the context of software. In fact, I've never seen the term used in the context of software.

Is it possible to learn from other's mistakes instead of making every mistake ourselves? The state of the industry make this sound unlikely, but let's give it a shot. John Banja has a nice summary paper on the normalization of deviance in healthcare, with lessons we can attempt to apply to software development. One thing to note is that, because Banja is concerned with patient outcomes, there's a close analogy to devops failure modes, but normalization of deviance also occurs in cultural contexts that are less directly analogous.

The first section of the paper details a number of disasters, both in healthcare and elsewhere. Here's one typical example:

A catastrophic negligence case that the author participated in as an expert witness involved an anesthesiologist's turning off a ventilator at the request of a surgeon who wanted to take an x-ray of the patient's abdomen (Banja, 2005, pp. 87-101). The ventilator was to be off for only a few seconds, but the anesthesiologist forgot to turn it back on, or thought he turned it back on but had not. The patient was without oxygen for a long enough time to cause her to experience global anoxia, which plunged her into a vegetative state. She never recovered, was disconnected from artificial ventilation 9 days later, and then died 2 days after that. It was later discovered that the anesthesia alarms and monitoring equipment in the operating room had been deliberately programmed to a “suspend indefinite” mode such that the anesthesiologist was not alerted to the ventilator problem. Tragically, the very instrumentality that was in place to prevent such a horror was disabled, possibly because the operating room staff found the constant beeping irritating and annoying.

Turning off or ignoring notifications because there are too many of them and they're too annoying? An erroneous manual operation? This could be straight out of the post-mortem of more than a few companies I can think of, except that the result was a tragic death instead of the loss of millions of dollars. If you read a lot of tech post-mortems, every example in Banja's paper will feel familiar even though the details are different.

The section concludes,

What these disasters typically reveal is that the factors accounting for them usually had “long incubation periods, typified by rule violations, discrepant events that accumulated unnoticed, and cultural beliefs about hazards that together prevented interventions that might have staved off harmful outcomes”. Furthermore, it is especially striking how multiple rule violations and lapses can coalesce so as to enable a disaster's occurrence.

Once again, this could be from an article about technical failures. That makes the next section, on why these failures happen, seem worth checking out. The reasons given are:

The rules are stupid and inefficient

The example in the paper is about delivering medication to newborns. To prevent “drug diversion,” nurses were required to enter their password onto the computer to access the medication drawer, get the medication, and administer the correct amount. In order to ensure that the first nurse wasn't stealing drugs, if any drug remained, another nurse was supposed to observe the process, and then enter their password onto the computer to indicate they witnessed the drug being properly disposed of.

That sounds familiar. How many technical postmortems start off with “someone skipped some steps because they're inefficient”, e.g., “the programmer force pushed a bad config or bad code because they were sure nothing could go wrong and skipped staging/testing”? The infamous November 2014 Azure outage happened for just that reason. At around the same time, a dev at one of Azure's competitors overrode the rule that you shouldn't push a config that fails tests because they knew that the config couldn't possibly be bad. When that caused the canary deploy to start failing, they overrode the rule that you can't deploy from canary into staging with a failure because they knew their config couldn't possibly be bad and so the failure must be from something else. That postmortem revealed that the config was technically correct, but exposed a bug in the underlying software; it was pure luck that the latent bug the config revealed wasn't as severe as the Azure bug.

Humans are bad at reasoning about how failures cascade, so we implement bright line rules about when it's safe to deploy. But the same thing that makes it hard for us to reason about when it's safe to deploy makes the rules seem stupid and inefficient.

Knowledge is imperfect and uneven

People don't automatically know what should be normal, and when new people are onboarded, they can just as easily learn deviant processes that have become normalized as reasonable processes.

Julia Evans described to me how this happens:

new person joins
new person: WTF WTF WTF WTF WTF
old hands: yeah we know we're concerned about it
new person: WTF WTF wTF wtf wtf w...
new person gets used to it
new person #2 joins
new person #2: WTF WTF WTF WTF
new person: yeah we know. we're concerned about it.

The thing that's really insidious here is that people will really buy into the WTF idea, and they can spread it elsewhere for the duration of their career. Once, after doing some work on an open source project that's regularly broken and being told that it's normal to have a broken build, and that they were doing better than average, I ran the numbers, found that project was basically worst in class, and wrote something about the idea that it's possible to have a build that nearly always passes with relatively low effort. The most common comment I got in response was, "Wow that guy must work with superstar programmers. But let's get real. We all break the build at least a few times a week", as if running tests (or for that matter, even attempting to compile) before checking code in requires superhuman abilities. But once people get convinced that some deviation is normal, they often get really invested in the idea.

I'm breaking the rule for the good of my patient

The example in the paper is of someone who breaks the rule that you should wear gloves when finding a vein. Their reasoning is that wearing gloves makes it harder to find a vein, which may result in their having to stick a baby with a needle multiple times. It's hard to argue against that. No one wants to cause a baby extra pain!

The second worst outage I can think of occurred when someone noticed that a database service was experiencing slowness. They pushed a fix to the service, and in order to prevent the service degradation from spreading, they ignored the rule that you should do a proper, slow, staged deploy. Instead, they pushed the fix to all machines. It's hard to argue against that. No one wants their customers to have degraded service! Unfortunately, the fix exposed a bug that caused a global outage.

The rules don't apply to me/You can trust me

most human beings perceive themselves as good and decent people, such that they can understand many of their rule violations as entirely rational and ethically acceptable responses to problematic situations. They understand themselves to be doing nothing wrong, and will be outraged and often fiercely defend themselves when confronted with evidence to the contrary.

As companies grow up, they eventually have to impose security that prevents every employee from being able to access basically everything. And at most companies, when that happens, some people get really upset. “Don't you trust me? If you trust me, how come you're revoking my access to X, Y, and Z?”

Facebook famously let all employees access everyone's profile for a long time, and you can even find HN comments indicating that some recruiters would explicitly mention that as a perk of working for Facebook. And I can think of more than one well-regarded unicorn where everyone still has access to basically everything, even after their first or second bad security breach. It's hard to get the political capital to restrict people's access to what they believe they need, or are entitled, to know. A lot of trendy startups have core values like “trust” and “transparency” which make it difficult to argue against universal access.

Workers are afraid to speak up

There are people I simply don't give feedback to because I can't tell if they'd take it well or not, and once you say something, it's impossible to un-say it. In the paper, the author gives an example of a doctor with poor handwriting who gets mean when people ask him to clarify what he's written. As a result, people guess instead of asking.

In most company cultures, people feel weird about giving feedback. Everyone has stories about a project that lingered on for months or years after it should have been terminated because no one was willing to offer explicit feedback. This is a problem even when cultures discourage meanness and encourage feedback: cultures of niceness seem to have as many issues around speaking up as cultures of meanness, if not more. In some places, people are afraid to speak up because they'll get attacked by someone mean. In others, they're afraid because they'll be branded as mean. It's a hard problem.

Leadership withholding or diluting findings on problems

In the paper, this is characterized by flaws and weaknesses being diluted as information flows up the chain of command. One example is how a supervisor might take sub-optimal actions to avoid looking bad to superiors.

I was shocked the first time I saw this happen. I must have been half a year or a year out of school. I saw that we were doing something obviously non-optimal, and brought it up with the senior person in the group. He told me that he didn't disagree, but that if we did it my way and there was a failure, it would be really embarrassing. He acknowledged that my way reduced the chance of failure without making the technical consequences of failure worse, but it was more important that we not be embarrassed. Now that I've been working for a decade, I have a better understanding of how and why people play this game, but I still find it absurd.

Solutions

Let's say you notice that your company has a problem that I've heard people at most companies complain about: people get promoted for heroism and putting out fires, not for preventing fires; and people get promoted for shipping features, not for doing critical maintenance work and bug fixing. How do you change that?

The simplest option is to just do the right thing yourself and ignore what's going on around you. That has some positive impact, but the scope of your impact is necessarily limited. Next, you can convince your team to do the right thing: I've done that a few times for practices I feel are really important and are sticky, so that I won't have to continue to expend effort on convincing people once things get moving.

But if the incentives are aligned against you, it will require an ongoing and probably unsustainable effort to keep people doing the right thing. In that case, the problem becomes convincing someone to change the incentives, and then making sure the change works as designed. How to convince people is worth discussing, but long and messy enough that it's beyond the scope of this post. As for making the change work, I've seen many “obvious” mistakes repeated, both in places I've worked and those whose internal politics I know a lot about.

Small companies have it easy. When I worked at a 100 person company, the hierarchy was individual contributor (IC) -> team lead (TL) -> CEO. That was it. The CEO had a very light touch, but if he wanted something to happen, it happened. Critically, he had a good idea of what everyone was up to and could basically adjust rewards in real-time. If you did something great for the company, there's a good chance you'd get a raise. Not in nine months when the next performance review cycle came up, but basically immediately. Not all small companies do that effectively, but with the right leadership, they can. That's impossible for large companies.

At large company A (LCA), they had the problem we're discussing and a mandate came down to reward people better for doing critical but low-visibility grunt work. There were too many employees for the mandator to directly make all decisions about compensation and promotion, but the mandator could review survey data, spot check decisions, and provide feedback until things were normalized. My subjective perception is that the company never managed to achieve parity between boring maintenance work and shiny new projects, but got close enough that people who wanted to make sure things worked correctly didn't have to significantly damage their careers to do it.

At large company B (LCB), ICs agreed that it's problematic to reward creating new features more richly than doing critical grunt work. When I talked to managers, they often agreed, too. But nevertheless, the people who get promoted are disproportionately those who ship shiny new things. I saw management attempt a number of cultural and process changes at LCB. Mostly, those took the form of pronouncements from people with fancy titles. For really important things, they might produce a video, and enforce compliance by making people take a multiple choice quiz after watching the video. The net effect I observed among other ICs was that people talked about how disconnected management was from the day-to-day life of ICs. But, for the same reasons that normalization of deviance occurs, that information seems to have no way to reach upper management.

It's sort of funny that this ends up being a problem about incentives. As an industry, we spend a lot of time thinking about how to incentivize consumers into doing what we want. But then we set up incentive systems that are generally agreed upon as incentivizing us to do the wrong things, and we do so via a combination of a game of telephone and cargo cult diffusion. Back when Microsoft was ascendant, we copied their interview process and asked brain-teaser interview questions. Now that Google is ascendant, we copy their interview process and ask algorithms questions. If you look around at trendy companies that are younger than Google, most of them basically copy their ranking/leveling system, with some minor tweaks. The good news is that, unlike many companies people previously copied, Google has put a lot of thought into most of their processes and made data driven decisions. The bad news is that Google is unique in a number of ways, which means that their reasoning often doesn't generalize, and that people often cargo cult practices long after they've become deprecated at Google.

This kind of diffusion happens for technical decisions, too. Stripe built a reliable message queue on top of Mongo, so we build reliable message queues on top of Mongo1. It's cargo cults all the way down2.

The paper has specific sub-sections on how to prevent normalization of deviance, which I recommend reading in full.

  • Pay attention to weak signals
  • Resist the urge to be unreasonably optimistic
  • Teach employees how to conduct emotionally uncomfortable conversations
  • System operators need to feel safe in speaking up
  • Realize that oversight and monitoring are never-ending

Let's look at how the first one of these, “pay attention to weak signals”, interacts with a single example, the “WTF WTF WTF” a new person gives off when the join the company.

If a VP decides something is screwed up, people usually listen. It's a strong signal. And when people don't listen, the VP knows what levers to pull to make things happen. But when someone new comes in, they don't know what levers they can pull to make things happen or who they should talk to almost by definition. They give out weak signals that are easily ignored. By the time they learn enough about the system to give out strong signals, they've acclimated.

“Pay attention to weak signals” sure sounds like good advice, but how do we do it? Strong signals are few and far between, making them easy to pay attention to. Weak signals are abundant. How do we filter out the ones that aren't important? And how do we get an entire team or org to actually do it? These kinds of questions can't be answered in a generic way; this takes real thought. We mostly put this thought elsewhere. Startups spend a lot of time thinking about growth, and while they'll all tell you that they care a lot about engineering culture, revealed preference shows that they don't. With a few exceptions, big companies aren't much different. At LCB, I looked through the competitive analysis slide decks and they're amazing. They look at every last detail on hundreds of products to make sure that everything is as nice for users as possible, from onboarding to interop with competing products. If there's any single screen where things are more complex or confusing than any competitor's, people get upset and try to fix it. It's quite impressive. And then when LCB onboards employees in my org, a third of them are missing at least one of, an alias/account, an office, or a computer, a condition which can persist for weeks or months. The competitive analysis slide decks talk about how important onboarding is because you only get one chance to make a first impression, and then employees are onboarded with the impression that the company couldn't care less about them and that it's normal for quotidian processes to be pervasively broken. LCB can't even to get the basics of employee onboarding right, let alone really complex things like acculturation. This is understandable — external metrics like user growth or attrition are measurable, and targets like how to tell if you're acculturating people so that they don't ignore weak signals are softer and harder to determine, but that doesn't mean they're any less important. People write a lot about how things like using fancier languages or techniques like TDD or agile will make your teams more productive, but having a strong engineering culture is much larger force multiplier.

Thanks to Sophie Smithburg and Marc Brooker for introducing me to the term Normalization of Deviance, and Kelly Eskridge, Leah Hanson, Sophie Rapoport, Sophie Smithburg, Julia Evans, Dmitri Kalintsev, Ralph Corderoy, Jamie Brandon, Egor Neliuba, and Victor Felder for comments/corrections/discussion.


  1. People seem to think I'm joking here. I can understand why, but try Googling mongodb message queue. You'll find statements like “replica sets in MongoDB work extremely well to allow automatic failover and redundancy”. Basically every company I know of that's done this and has anything resembling scale finds this to be non-optimal, to say the least, but you can't actually find blog posts or talks that discuss that. All you see are the posts and talks from when they first tried it and are in the honeymoon period. This is common with many technologies. You'll mostly find glowing recommendations in public even when, in private, people will tell you about all the problems. Today, if you do the search mentioned above, you'll get a ton of posts talking about how amazing it is to build a message queue on top of Mongo, this footnote, and a maybe couple of blog posts by Kyle Kingsbury depending on your exact search terms.

    If there were an acute failure, you might see a postmortem, but while we'll do postmortems for "the site was down for 30 seconds", we rarely do postmortems for "this takes 10x as much ops effort as the alternative and it's a death by a thousand papercuts", "we architected this thing poorly and now it's very difficult to make changes that ought to be trivial", or "a competitor of ours was able to accomplish the same thing with an order of magnitude less effort". I'll sometimes do informal postmortems by asking everyone involved oblique questions about what happened, but more for my own benefit than anything else, because I'm not sure people really want to hear the whole truth. This is especially sensitive if the effort has generated a round of promotions, which seems to be more common the more screwed up the project. The larger the project, the more visibility and promotions, even if the project could have been done with much less effort.

    [return]
  2. I've spent a lot of time asking about why things are the way they are, both in areas where things are working well, and in areas where things are going badly. Where things are going badly, everyone has ideas. But where things are going well, as in the small company with the light-touch CEO mentioned above, almost no one has any idea why things work. It's magic. If you ask, people will literally tell you that it seems really similar to some other place they've worked, except that things are magically good instead of being terrible for reasons they don't understand. But it's not magic. It's hard work that very few people understand. Something I've seen multiple times is that, when a VP leaves, a company will become a substantially worse place to work, and it will slowly dawn on people that the VP was doing an amazing job at supporting not only their direct reports, but making sure that everyone under them was having a good time. It's hard to see until it changes, but if you don't see anything obviously wrong, either you're not paying attention or someone or many someones have put a lot of work into making sure things run smoothly. [return]

Big companies v. startups

2015-12-17 08:00:00

There's a meme that's been going around for a while now: you should join a startup because the money is better and the work is more technically interesting. Paul Graham says that the best way to make money is to "start or join a startup", which has been "a reliable way to get rich for hundreds of years", and that you can "compress a career's worth of earnings into a few years". Michael Arrington says that you'll become a part of history. Joel Spolsky says that by joining a big company, you'll end up playing foosball and begging people to look at your code. Sam Altman says that if you join Microsoft, you won't build interesting things and may not work with smart people. They all claim that you'll learn more and have better options if you go work at a startup. Some of these links are a decade old now, but the same ideas are still circulating and those specific essays are still cited today.

Let's look at these points one one-by-one.

  1. You'll earn much more money at a startup
  2. You won't do interesting work at a big company
  3. You'll learn more at a startup and have better options afterwards

1. Earnings

The numbers will vary depending on circumstances, but we can do a back of the envelope calculation and adjust for circumstances afterwards. Median income in the U.S. is about $30k/yr. The somewhat bogus zeroth order lifetime earnings approximation I'll use is $30k * 40 = $1.2M. A new grad at Google/FB/Amazon with a lowball offer will have a total comp (salary + bonus + equity) of $130k/yr. According to glassdoor's current numbers, someone who makes it to T5/senior at Google should have a total comp of around $250k/yr. These are fairly conservative numbers1.

Someone who's not particularly successful, but not particularly unsuccessful will probably make senior in five years2. For our conservative baseline, let's assume that we'll never make it past senior, into the pay grades where compensation really skyrockets. We'd expect earnings (total comp including stock, but not benefits) to looks something like:

Year Total Comp Cumulative
0 130k 130k
1 160k 290k
2 190k 480k
3 220k 700k
4 250k 950k
5 250k 1.2M
... ...
9 250k 2.2M
39 250k 9.7M

Looks like it takes six years to gross a U.S. career's worth of income. If you want to adjust for the increased tax burden from earning a lot in a few years, add an extra year. Maybe add one to two more years if you decide to live in the bay or in NYC. If you decide not to retire, lifetime earnings for a 40 year career comes in at almost $10M.

One common, but false, objection to this is that your earnings will get eaten up by the cost of living in the bay area. Not only is this wrong, it's actually the opposite of correct. You can work at these companies from outside the bay area; most of these companies will pay you maybe 10% less if you work in a location where cost of living is around the U.S. median by working in a satellite office of a trendy company headquartered in SV or Seattle (at least if you work in the US -- pay outside of the US is often much lower for reasons that don't really make sense to me). Market rate at smaller companies in these areas tends to be very low. When I interviewed in places like Portland and Madison, there was a 3x-5x difference between what most small companies were offering and what I could get at a big company in the same city. In places like Austin, where the market is a bit thicker, it was a 2x-3x difference. The difference in pay at 90%-ile companies is greater, not smaller, outside of the SF bay area.

Another objection is that most programmers at most companies don't make this kind of money. If, three or four years ago, you'd told me that there's a career track where it's totally normal to make $250k/yr after a few years, doing work that was fundamentally pretty similar to the work I was doing then, I'm not sure I would have believed it. No one I knew made that kind of money, except maybe the CEO of the company I was working at. Well him, and folks who went into medicine or finance.

The only difference between then and now is that I took a job at a big company. When I took that job, the common story I heard at orientation was basically “I never thought I'd be able to get a job at Google, but a recruiter emailed me and I figured I might as well respond”. For some reason, women were especially likely to have that belief. Anyway, I've told that anecdote to multiple people who didn't think they could get a job at some trendy large company, who then ended up applying and getting in. And what you'll realize if you end up at a place like Google is that most of them are just normal programmers like you and me. If anything, I'd say that Google is, on average, less selective than the startup I worked at. When you only have to hire 100 people total, and half of them are folks you worked with as a technical fellow at one big company and then as an SVP at another one, you can afford to hire very slowly and being extremely selective. Big companies will hire more than 100 people per week, which means they can only be so selective.

Despite the hype about how hard it is to get a job at Google/FB/wherever, your odds aren't that bad, and they're certainly better than your odds striking it rich at a startup, for which Patrick McKenzie has a handy cheatsheet:

Roll d100. (Not the right kind of geek? Sorry. rand(100) then.)
0~70: Your equity grant is worth nothing.
71~94: Your equity grant is worth a lump sum of money which makes you about as much money as you gave up working for the startup, instead of working for a megacorp at a higher salary with better benefits.
95~99: Your equity grant is a life changing amount of money. You won't feel rich — you're not the richest person you know, because many of the people you spent the last several years with are now richer than you by definition — but your family will never again give you grief for not having gone into $FAVORED_FIELD like a proper $YOUR_INGROUP.
100: You worked at the next Google, and are rich beyond the dreams of avarice. Congratulations.
Perceptive readers will note that 100 does not actually show up on a d100 or rand(100).

For a more serious take that gives approximately the same results, 80000 hours finds that the average value of a YC founder after 5-9 years is $18M. That sounds great! But there are a few things to keep in mind here. First, YC companies are unusually successful compared to the average startup. Second, in their analysis, 80000 hours notes that 80% of the money belongs to 0.5% of companies. Another 22% are worth enough that founder equity beats working for a big company, but that leaves 77.5% where that's not true.

If you're an employee and not a founder, the numbers look a lot worse. If you're a very early employee you'd be quite lucky to get 1/10th as much equity as a founder. If we guess that 30% of YC startups fail before hiring their first employee, that puts the mean equity offering at $1.8M / .7 = $2.6M. That's low enough that for 5-9 years of work, you really need to be in the 0.5% for the payoff to be substantially better than working at a big company unless the startup is paying a very generous salary.

There's a sense in which these numbers are too optimistic. Even if the company is successful and has a solid exit, there are plenty of things that can make your equity grant worthless. It's hard to get statistics on this, but anecdotally, this seems to be the common case in acquisitions.

Moreover, the pitch that you'll only need to work for four years is usually untrue. To keep your lottery ticket until it pays out (or fizzles out), you'll probably have to stay longer. The most common form of equity at early stage startups are ISOs that, by definition, expire 90 at most days after you leave. If you get in early, and leave after four years, you'll have to exercise your options if you want a chance at the lottery ticket paying off. If the company hasn't yet landed a large valuation, you might be able to get away with paying O(median US annual income) to exercise your options. If the company looks like a rocket ship and VCs are piling in, you'll have a massive tax bill, too, all for a lottery ticket.

For example, say you joined company X early on and got options for 1% of the company when it was valued at $1M, so the cost exercising all of your options is only $10k. Maybe you got lucky and four years later, the company is valued at $1B and your options have only been diluted to .5%. Great! For only $10k you can exercise your options and then sell the equity you get for $5M. Except that the company hasn't IPO'd yet, so if you exercise your options, you're stuck with a tax bill from making $5M, and by the time the company actually has an IPO, your stock could be worthy anywhere from $0 to $LOTS. In some cases, you can sell your non-liquid equity for some fraction of its “value”, but my understanding is that it's getting more common for companies to add clauses that limit your ability to sell your equity before the company has an IPO. And even when your contract doesn't have a clause that prohibits you from selling your options on a secondary market, companies sometimes use backchannel communications to keep you from being able to sell your options.

Of course not every company is like this -- I hear that Dropbox has generously offered to buy out people's options at their current valuation for multiple years running and they now hand out RSUs instead of options, and Pinterest now gives people seven years to exercise their options after they leave -- but stories like that are uncommon enough that they're notable. The result is that people are incentivized to stay at most startups, even if they don't like the work anymore. From chatting with my friends at well regarded highly-valued startups, it sounds like many of them have a substantial fraction of zombie employees who are just mailing it in and waiting for a liquidity event. A common criticism of large companies is that they've got a lot of lifers who are mailing it in, but most large companies will let you leave any time after the first year and walk away with a pro-rated fraction of your equity package3. It's startups where people are incentivized to stick around even if they don't care about the job.

At a big company, we have a career's worth of income in six years with high probability once you get your foot in the door. This isn't quite as good as the claim that you'll be able to do that in three or four years at a startup, but the risk at a big company is very low once you land the job. In startup land, we have a lottery ticket that appears to have something like a 0.5% chance of paying off for very early employees. Startups might have had a substantially better expected value when Paul wrote about this in 2004, but big company compensation has increased much faster than compensation at the median startup. We're currently in the best job market the world has ever seen for programmers. That's likely to change at some point. The relative returns on going the startup route will probably look a lot better once things change, but for now, saving up some cash while big companies hand it out like candy doesn't seem like a bad idea.

One additional thing to note is that it's possible to get the upside of working at a startup by working at a big company and investing in startups. As of this update (mid-2020), it's common for companies to raise seed rounds at valuations of ~$10M and take checks as small as $5k. This means, for $100k, you can get as much of the company as you'd get if you joined as a very early employee, perhaps even employee #1 if you're not already very senior or recognized in the industry. But the stock you get by investing has better terms than employee equity not even considering vesting, and since your investment doesn't need to vest and you get it immediately and you typically have to stay for four years for your employee equity to vest, you actually only need to invest $25k/yr to get the equity benefit of being a very early employee. Not only can you get better risk adjusted returns (by diversifying), you'll also have much more income if you work at a big company and invest $25k/yr than if you work at a startup.

2. Interesting work

We've established that big companies will pay you decently. But there's more to life than making money. After all, you spend 40 hours a week working (or more). How interesting is the work at big companies? Joel claimed that large companies don't solve interesting problems and that Google is paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex, Sam Altman said something similar (but much more measured) about Microsoft, every third Michael O. Church comment is about how Google tricks a huge number of overqualified programmers into taking jobs that no one wants. Basically every advice thread on HN or reddit aimed at new grads will have multiple people chime in on how the experience you get at startups is better than the experience you'll get slaving away at a big company.

The claim that big companies have boring work is too broad and absolute to even possibly be true. It depends on what kind of work you want to do. When I look at conferences where I find a high percentage of the papers compelling, the stuff I find to be the most interesting is pretty evenly split between big companies and academia, with the (very) occasional paper by a startup. For example, looking at ISCA this year, there's a 2:1 ratio of papers from academia to industry (and all of the industry papers are from big companies). But looking at the actual papers, a significant fraction of the academic papers are reproducing unpublished work that was done at big companies, sometimes multiple years ago. If I only look at the new work that I'm personally interested in, it's about a 1:1 ratio. There are some cases where a startup is working in the same area and not publishing, but that's quite rare and large companies do much more research that they don't publish. I'm just using papers as a proxy for having the kind of work I like. There are also plenty of areas where publishing isn't the norm, but large companies do the bulk of the cutting edge work.

Of course YMMV here depending on what you want to do. I'm not really familiar with the landscape of front-end work, but it seems to me that big companies don't do the vast majority of the cutting edge non-academic work, the way they do with large scale systems. IIRC, there's an HN comment where Jonathan Tang describes how he created his own front-end work: he had the idea, told his manager about it, and got approval to make it happen. It's possible to do that kind of thing at a large company, but people often seem to have an easier time pursuing that kind of idea at a small company. And if your interest is in product, small companies seem like the better bet (though, once again, I'm pretty far removed from that area, so my knowledge is secondhand).

But if you're interested in large systems, at both of my last two jobs, I've seen speculative research projects with 9 figure pilot budgets approved. In a pitch for one of the products, the pitch wasn't even that the project would make the company money. It was that a specific research area was important to the company, and that this infrastructure project would enable the company to move faster in that research area. Since the company is a $X billion dollar a year company, the project only needed to move the needle by a small percentage to be worth it. And so a research project whose goal was to speed up the progress of another research project was approved. Interally, this kind of thing is usually determined by politics, which some people will say makes it not worth it. But if you have a stomach for big company politics, startups simply don't have the resources to fund research problems that aren't core to their business. And many problems that would be hard problems at startups are curiosities at large companies.

The flip side of this is that there are experiments that startups have a very easy time doing that established companies can't do. When I was at EC a number of years ago, back when Facebook was still relatively young, the Google ad auction folks remarked to the FB folks that FB was doing the sort of experiments they'd do if they were small enough to do them, but they couldn't just change the structure of their ad auctions now that there was so much money flowing through their auctions. As with everything else we're discussing, there's a trade-off here and the real question is how to weight the various parts of the trade-off, not which side is better in all ways.

The Michael O. Church claim is somewhat weaker: big companies have cool stuff to work on, but you won't be allowed to work on them until you've paid your dues working on boring problems. A milder phrasing of this is that getting to do interesting work is a matter of getting lucky and landing on an initial project you're interested in, but the key thing here is that most companies can give you a pretty good estimate about how lucky you're going to be. Google is notorious for its blind allocation process, and I know multiple people who ended up at MS because they had the choice between a great project at MS and blind allocation at Google, but even Google has changed this to some extent and it's not uncommon to be given multiple team options with an offer. In that sense, big companies aren't much different from startups. It's true that there are some startups that will basically only have jobs that are interesting to you (e.g., an early-stage distributed database startup if you're interested in building a distributed database). But at any startup that's bigger and less specialized, there's going to be work you're interested in and work you're not interested in, and it's going to be up to you to figure out if your offer lets you work on stuff you're interested in.

Something to note is that if, per (1), you have the leverage to negotiate a good compensation package, you also have the leverage to negotiate for work that you want to do. We're in what is probably the best job market for programmers ever. That might change tomorrow, but until it changes, you have a lot of power to get work that you want.

3. Learning / Experience

What about the claim that experience at startups is more valuable? We don't have the data to do a rigorous quantitative comparison, but qualitatively, everything's on fire at startups, and you get a lot of breadth putting out fires, but you don't have the time to explore problems as deeply.

I spent the first seven years of my career at a startup and I loved it. It was total chaos, which gave me the ability to work on a wide variety of different things and take on more responsibility than I would have gotten at a bigger company. I did everything from add fault tolerance to an in-house distributed system to owning a quarter of a project that added ARM instructions to an x86 chip, creating both the fastest ARM chip at the time, as well as the only chip capable of switching between ARM and x86 on the fly4. That was a great learning experience.

But I've had great learning experiences at big companies, too. At Google, my “starter” project was to join a previously one-person project, read the half finished design doc, provide feedback, and then start implementing. The impetus for the project was that people were worried that image recognition problems would require Google to double the number of machines it owns if a somewhat unlikely but not impossible scenario happened. That wasn't too much different from my startup experience, except for that bit about actually having a design doc, and that cutting infra costs could save billions a year instead of millions a year.

Was that project a better or worse learning experience than the equivalent project at a startup? At a startup, the project probably would have continued to be a two-person show, and I would have learned all the things you learn when you bang out a project with not enough time and resources and do half the thing yourself. Instead, I ended up owning a fraction of the project and merely provided feedback on the rest, and it was merely a matter of luck (timing) that I had significant say on fleshing out the architecture. I definitely didn't get the same level of understanding I would have if I implemented half of it myself. On the other hand, the larger team meant that we actually had time to do things like design reviews and code reviews.

If you care about impact, it's also easier to have a large absolute impact at a large company, due to the scale that big companies operate at. If I implemented what I'm doing now for a companies the size of the startup I used to work for, it would have had an impact of maybe $10k/month. That's nothing to sneeze at, but it wouldn't have covered my salary. But the same thing at a big company is worth well over 1000x that. There are simply more opportunities to have high impact at large companies because they operate at a larger scale. The corollary to this is that startups are small enough that it's easier to have an impact on the company itself, even when the impact on the world is smaller in absolute terms. Nothing I do is make or break for a large company, but when I worked at a startup, it felt like what we did could change the odds of the company surviving.

As far as having better options after having worked for a big company or having worked for a startup, if you want to work at startups, you'll probably have better options with experience at startups. If you want to work on the sorts of problems that are dominated by large companies, you're better off with more experience in those areas, at large companies. There's no right answer here.

Conclusion

The compensation trade-off has changed a lot over time. When Paul Graham was writing in 2004, he used $80k/yr as a reasonable baseline for what “a good hacker” might make. Adjusting for inflation, that's about $100k/yr now. But the total comp for “a good hacker” is $250k+/yr, not even counting perks like free food and having really solid insurance. The trade-off has heavily tilted in favor of large companies.

The interesting work trade-off has also changed a lot over time, but the change has been… bimodal. The existence of AWS and Azure means that ideas that would have taken millions of dollars in servers and operational expertise can be done with almost no fixed cost and low marginal costs. The scope of things you can do at an early-stage startup that were previously the domain of well funded companies is large and still growing. But at the same time, if you look at the work Google and MS are publishing at top systems conferences, startups are farther from being able to reproduce the scale-dependent work than ever before (and a lot of the most interesting work doesn't get published). Depending on what sort of work you're interested in, things might look relatively better or relatively worse at big companies.

In any case, the reality is that the difference between types of companies is smaller than the differences between companies of the same type. That's true whether we're talking about startups vs. big companies or mobile gaming vs. biotech. This is recursive. The differences between different managers and teams at a company can easily be larger than the differences between companies. If someone tells you that you should work for a certain type of company, that advice is guaranteed to be wrong much of the time, whether that's a VC advocating that you should work for a startup or a Turing award winner telling you that you should work in a research lab.

As for me, well, I don't know you and it doesn't matter to me whether you end up at a big company, a startup, or something in between. Whatever you decide, I hope you get to know your manager well enough to know that they have your back, your team well enough to know that you like working with them, and your project well enough to know that you find it interesting. Big companies have a class of dysfunction that's unusual at startups5 and startups have their own kinds of dysfunction. You should figure out what the relevant tradeoffs are for you and what kind of dysfunction you want to sign up for.

Myself on options vs. cash.

Jocelyn Goldfein on big companies vs. small companies.

Patrick McKenzie on providing business value vs. technical value, with a response from Yossi Kreinin.

Yossi Kreinin on passion vs. money, and with a rebuttal to this post on regret minimization.

Update: The responses on this post have been quite divided. Folks at big companies usually agree, except that the numbers seem low to them, especially for new grads. This is true even for people who living in places which have a cost of living similar to U.S. median. On the other hand, a lot of people vehemently maintain that the numbers in this post are basically impossible. A lot of people are really invested in the idea that they're making about as much as possible. If you've decided that making less money is the right trade-off for you, that's fine and I don't have any problem with that. But if you really think that you can't make that much money and you don't believe me, I recommend talking to one of the hundreds of thousands of engineers at one of the many large companies that pays well.

Update 2: This post was originally written in 2015, when the $250k number would be conservative but not unreasonable for someone who's "senior" at Google or FB. If we look at the situation today in 2017, people one entire band below that are regularly bringing in $250k and a better estimate might be $300k or $350k. I'm probably more bear-ish on future dev compensation than most people, but things are looking pretty good for now and an event that wipes out big company dev compensation seems likely to do even worse things to the options packages for almost all existing startups.

Udpate 3: Added note on being able to invest in startups in 2020. I didn't realize that this was possible without having a lot of wealth until around 2019.

Thanks to Kelly Eskridge, Leah Hanson, Julia Evans, Alex Clemmer, Ben Kuhn, Malcolm Matalka, Nick Bergson-Shilcock, Joe Wilder, Nat Welch, Darius Bacon, Lindsey Kuper, Prabhakar Ragde, Pierre-Yves Baccou, David Turner, Oskar Thoren, Katerina Barone-Adesi, Scott Feeney, Ralph Corderoy, Ezekiel Benjamin Smithburg, @agentwaj, and Kyle Littler for comments/corrections/discussion.


  1. In particular, the Glassdoor numbers seem low for an average. I suspect that's because their average is weighed down by older numbers, while compensation has skyrocketed the past seven years. The average numbers on Glassdoor don't even match the average numbers I heard from other people in my Midwestern satellite office in a large town two years ago, and the market has gone up sharply since then. More recently, on the upper end, I know someone fresh out of school who has a total comp of almost $250k/yr ($350k equity over four years, a $50k signing bonus, plus a generous salary). As is normal, they got a number of offers with varying compensation levels, and then Facebook came in and bid him up. The companies that are serious about competing for people matched the offers, and that was that. This included bids in Seattle and Austin that matched the bids in SV. If you're negotiating an offer, the thing that's critical isn't to be some kind of super genius. It's enough to be pretty good, know what the market is paying, and have multiple offers. This person was worth every penny, which is why he got his offers, but I know several people who are just as good who make half as much just because they only got a single offer and had no leverage.

    Anyway, the point of this footnote is just that the total comp for experienced engineers can go way above the numbers mentioned in the post. In the analysis that follows, keep in mind that I'm using conservative numbers and that an aggressive estimate for experienced engineers would be much higher. Just for example, at Google, senior is level 5 out of 11 on a scale that effectively starts at 3. At Microsoft, it's 63 out of a weirdo scale that starts at 59 and goes to 70-something and then jumps up to 80 (or something like that, I always forget the details because the scale is so silly). Senior isn't a particularly high band, and people at senior often have total comp substantially greater than $250k/yr. Note that these numbers also don't include the above market rate of stock growth at trendy large companies in the past few years. If you've actually taken this deal, your RSUs have likely appreciated substantially.

    [return]
  2. This depends on the company. It's true at places like Facebook and Google, which make a serious effort to retain people. It's nearly completely untrue at places like IBM, National Instruments (NI), and Epic Systems, which don't even try. And it's mostly untrue at places like Microsoft, which tries, but in the most backwards way possible.

    Microsoft (and other mid-tier companies) will give you an ok offer and match good offers from other companies. That by itself is already problematic since it incentivizes people who are interviewing at Microsoft to also interview elsewhere. But the worse issue is that they do the same when retaining employees. If you stay at Microsoft for a long time and aren't one of the few people on the fast track to "partner", your pay is going to end up severely below market, sometime by as much as a factor of two. When you realize that, and you interview elsewhere, Microsoft will match external offers, but after getting underpaid for years, by hundreds of thousands or millions of dollars (depending on how long you've been there), the promise of making market rate for a single year and then being underpaid for the foreseeable future doesn't seem very compelling. The incentive structure appears as if it were designed to cause people who are between average and outstanding to leave. I've seen this happen with multiple people and I know multiple others who are planning to leave for this exact reason. Their managers are always surprised when this happens, but they shouldn't be; it's eminently predictable.

    The IBM strategy actually makes a lot more sense to me than the Microsoft strategy. You can save a lot of money by paying people poorly. That makes sense. But why bother paying a lot to get people in the door and then incentivizing them to leave? While it's true that the very top people I work with are well compensated and seem happy about it, there aren't enough of those people that you can rely on them for everything.

    [return]
  3. Some are better about this than others. Older companies, like MS, sometimes have yearly vesting, but a lot of younger companies, like Google, have much smoother vesting schedules once you get past the first year. And then there's Amazon, which backloads its offers, knowing that they have a high attrition rate and won't have to pay out much. [return]
  4. Sadly, we ended up not releasing this for business reasons that came up later. [return]
  5. My very first interaction with an employee at big company X orientation was having that employee tell me that I couldn't get into orientation because I wasn't on the list. I had to ask how I could get on the list, and I was told that I'd need an email from my manager to get on the list. This was at around 7:30am because orientation starts at 7:30 and then runs for half a day for reasons no one seems to know (I've asked a lot of people, all the way up to VPs in HR). When I asked if I could just come back later in the day, I was told that if I couldn't get in within an hour I'd have to come back next week. I also asked if the fact that I was listed in some system as having a specific manager was evidence that I was supposed to be at orientation and was told that I had to be on the list. So I emailed my manager, but of course he didn't respond because who checks their email at 7:30am? Luckily, my manager had previously given me his number and told me to call if I ever needed anything, and being able to get into orientation and not have to show up at 7:30am again next week seemed like anything, so I gave him a call. Naturally, he asked to talk to the orientation gatekeeper; when I relayed that the orientation guy, he told me that he couldn't talk on the phone -- you see, he can only accept emails and can't talk on the phone, not even just to clarify something. Five minutes into orientation, I was already flabbergasted. But, really, I should have considered myself lucky -- the other person who “wasn't on the list” didn't have his manager's phone number, and as far as I know, he had to come back the next week at 7:30am to get into orientation. I asked the orientation person how often this happens, and he told me “very rarely, only once or twice per week”.

    That experience was repeated approximately every half hour for the duration of orientation. I didn't get dropped from any other orientation stations, but when I asked, I found that every station had errors that dropped people regularly. My favorite was the station where someone was standing at input queue, handing out a piece of paper. The piece of paper informed you that the machine at the station was going to give you an error with some instructions about what to do. Instead of following those instructions, you had to follow the instructions on the piece of paper when the error occurred.

    These kinds of experiences occupied basically my entire first week. Now that I'm past onboarding and onto the regular day-to-day, I have a surreal Kafka-esque experience a few times a week. And I've mostly figured out how to navigate the system (usually, knowing the right person and asking them to intervene solves the problem). What I find to be really funny isn't the actual experience, but that most people I talk to who've been here a while think that it literally cannot be any other way and that things could not possibly be improved. Curiously, people who have been here as long who are very senior tend to agree that the company has its share of big company dysfunction. I wish I had enough data on that to tell which way the causation runs (are people who are aware of the function more likely to last long enough to become very senior, or does being very senior give you a perspective that lets you see more dysfunction). Something that's even curiouser is that the company invests a fair amount of effort to give people the impression that things are as good as they could possibly be. At orientation, we got a version of history that made it sound as if the company had pioneered everything from the GUI to the web, with multiple claims that we have the best X in the world, even when X is pretty clear mediocre. It's not clear to me what the company gets out of making sure that most employees don't understand what the downsides are in our own products and processes.

    Whatever the reason, the attitude that things couldn't possibly be improved isn't just limited to administrative issues. A friend of mine needed to find a function to do something that's a trivial one liner on Linux, but that's considerably more involved on our OS. His first attempt was to use boost, but it turns out that the documentation for doing this on the OS we use is complicated enough that boost got this wrong and has had a bug in it for years. A couple days, and 72 lines of code later, he managed to figure out how to create a function to accomplish his goal. Since he wasn't sure if he was missing something, he forwarded the code review to two very senior engineers (one level below Distinguished Engineer). They weren't sure and forwarded it on to the CTO, who said that he didn't see a simpler way to accomplish the same thing in our OS with the APIs as they currently are.

    Later, my friend had a heated discussion with someone on the OS team, who maintained that the documentation on how to do this was very clear, and that it couldn't be clearer, nor could the API be any easier. This is despite this being so hard to do that boost has been wrong for seven years, and that two very senior engineers didn't feel confident enough to review the code and passed it up to a CTO.

    I'm going to stop here not because I'm out of incidents like this, but because a retelling of a half year of big company stories is longer than my blog. Not just longer than this post or any individual post, but longer than everything else on my blog combined, which is a bit over 100k words. Typical estimates for words per page vary between 250 and 1000, putting my rate of surreal experiences at somewhere between 100 and 400 pages every six months. I'm not sure this rate is inherently different from the rate you'd get at startups, but there's a different flavor to the stories and you should have an idea of the flavor by this point.

    [return]

Files are hard

2015-12-12 08:00:00

I haven't used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, Eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can't, but none of the file corruption issues I've had have been from total disk failure. Why has my experience with desktop applications been so bad?

Well, what sort of failures can occur? Crash consistency (maintaining consistent state even if there's a crash) is probably the easiest property to consider, since we can assume that everything, from the filesystem to the disk, works correctly; let's consider that first.

Crash Consistency

Pillai et al. had a paper and presentation at OSDI '14 on exactly how hard it is to save data without corruption or data loss.

Let's look at a simple example of what it takes to save data in a way that's robust against a crash. Say we have a file that contains the text a foo and we want to update the file to contain a bar. The pwrite function looks like it's designed for this exact thing. It takes a file descriptor, what we want to write, a length, and an offset. So we might try

pwrite([file], “bar”, 3, 2)  // write 3 bytes at offset 2

What happens? If nothing goes wrong, the file will contain a bar, but if there's a crash during the write, we could get a boo, a far, or any other combination. Note that you may want to consider this an example over sectors or blocks and not chars/bytes.

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we're about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log. We might write something like

creat(/dir/log);
write(/dir/log, “2,3,foo”, 7);
pwrite(/dir/orig, “bar”, 3, 2);
unlink(/dir/log);

This should allow recovery from a crash without data corruption via the undo log, at least if we're using ext3 and we made sure to mount our drive with data=journal. But we're out of luck if, like most people, we're using the default1 -- with the default data=ordered, the write and pwrite syscalls can be reordered, causing the write to orig to happen before the write to the log, which defeats the purpose of having a log. We can fix that.

creat(/dir/log);
write(/dir/log, “2, 3, foo”);
fsync(/dir/log);  // don't allow write to be reordered past pwrite
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should force things to occur in the correct order, at least if we're using ext3 with data=journal or data=ordered. If we're using data=writeback, a crash during the the write or fsync to log can leave log in a state where the filesize has been adjusted for the write of “bar”, but the data hasn't been written, which means that the log will contain random garbage. This is because with data=writeback, metadata is journaled, but data operations aren't, which means that data operations (like writing data to a file) aren't ordered with respect to metadata operations (like adjusting the size of a file for a write).

We can fix that by adding a checksum to the log file when creating it. If the contents of log don't contain a valid checksum, then we'll know that we ran into the situation described above.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);  // add checksum to log file
fsync(/dir/log);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That's safe, at least on current configurations of ext3. But it's legal for a filesystem to end up in a state where the log is never created unless we issue an fsync to the parent directory.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);  // fsync parent directory of log file
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);
fsync(/dir);

That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 -- to look better in benchmarks. Handling that is beyond the scope of this post, but the Rajimwale et al. DSN '11 paper and related work cover that issue.

Filesystem semantics

When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. They wrote a tool that collects block-level filesystem traces, and used that to determine which properties don't hold for specific filesystems. The authors are careful to note that they can only determine when properties don't hold -- if they don't find a violation of a property, that's not a guarantee that the property holds.

Different filesystems have very different properties

Xs indicate that a property is violated. The atomicity properties are basically what you'd expect, e.g., no X for single sector overwrite means that writing a single sector is atomic. The authors note that the atomicity of single sector overwrite sometimes comes from a property of the disks they're using, and that running these filesystems on some disks won't give you single sector atomicity. The ordering properties are also pretty much what you'd expect from their names, e.g., an X in the “Overwrite -> Any op” row means that an overwrite can be reordered with some operation.

After they created a tool to test filesystem properties, they then created a tool to check if any applications rely on any potentially incorrect filesystem properties. Because invariants are application specific, the authors wrote checkers for each application tested.

Everything is broken

The authors find issues with most of the applications tested, including things you'd really hope would work, like LevelDB, HDFS, Zookeeper, and git. In a talk, one of the authors noted that the developers of sqlite have a very deep understanding of these issues, but even that wasn't enough to prevent all bugs. That speaker also noted that version control systems were particularly bad about this, and that the developers had a pretty lax attitude that made it very easy for the authors to find a lot of issues in their tools. The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic2. These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn't treated the same way, even though it's actually harder in a number of ways.

Something to note here is that while btrfs's semantics aren't inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren't used to coding against filesystems that allow directory operations to be reordered (ext2 is perhaps the most recent widely used filesystem that allowed that reordering). We'll probably see a similar level of bug exposure when people start using NVRAM drives that have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they're coding against what's legal in a POSIX filesystem.

Hardware memory ordering semantics are usually well documented in a way that makes it simple to determine precisely which operations can be reordered with which other operations, and which operations are atomic. By contrast, here's the ext manpage on its three data modes:

journal: All data is committed into the journal prior to being written into the main filesystem.

ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.

writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

The manpage literally refers to rumor. This is the level of documentation we have. If we look back at our example where we had to add an fsync between the write(/dir/log, “2, 3, foo”) and pwrite(/dir/orig, 2, “bar”) to prevent reordering, I don't think the necessity of the fsync is obvious from the description in the manpage. If you look at the hardware memory ordering “manpage” above, it specifically defines the ordering semantics, and it certainly doesn't rely on rumor.

This isn't to say that filesystem semantics aren't documented anywhere. Between lwn and LKML, it's possible to get a good picture of how things work. But digging through all of that is hard enough that it's still quite common for there to be long, uncertain discussions on how things work. A lot of the information out there is wrong, and even when information was right at the time it was posted, it often goes out of date.

When digging through archives, I've often seen a post from 2005 cited to back up the claim that OS X fsync is the same as Linux fsync, and that OS X fcntl(F_FULLFSYNC) is even safer than anything available on Linux. Even at the time, I don't think that was true for the 2.4 kernel, although it was true for the 2.6 kernel. But since 2008 or so Linux 2.6 with ext3 will do a full flush to disk for each fsync (if the disk supports it, and the filesystem hasn't been specially configured with barriers off).

Another issue is that you often also see exchanges like this one:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
Dev 3: Probably from some 6-8 years ago, in e-mail postings that I made.

Where's this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). I don't mean to pick on filesystem devs. The fs devs whose posts I've read are quite polite compared to LKML's reputation; they generously spend a lot of their time responding to basic questions and I'm impressed by how patient the expert fs devs are with askers, but it's hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted!

In their OSDI 2014 talk, the authors of the paper we're discussing noted that when they reported bugs they'd found, developers would often respond “POSIX doesn't let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you've followed Kyle Kingsbury's Jepsen work, this may sound familiar, except devs respond with “filesystems don't do that” instead of “networks don't do that”. I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I'd be a bit surprised if I don't have at least one bug in this post.

Filesystem correctness

We've already encountered a lot of complexity in saving data correctly, and this only scratches the surface of what's involved. So far, we've assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via SMART or some other kind of monitoring. I'd always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.

The Prabhakaran et al. SOSP 05 paper examined how filesystems respond to disk errors in some detail. They created a fault injection layer that allowed them to inject disk faults and then ran things like chdir, chroot, stat, open, write, etc. to see what would happen.

Between ext3, reiserfs, and NTFS, reiserfs is the best at handling errors and it seems to be the only filesystem where errors were treated as first class citizens during design. It's mostly consistent about propagating errors to the user on reads, and calling panic on write failures, which triggers a restart and recovery. This general policy allows the filesystem to gracefully handle read failure and avoid data corruption on write failures. However, the authors found a number of inconsistencies and bugs. For example, reiserfs doesn't correctly handle read errors on indirect blocks and leaks space, and a specific type of write failure doesn't prevent reiserfs from updating the journal and committing the transaction, which can result in data corruption.

Reiserfs is the good case. The authors found that ext3 ignored write failures in most cases, and rendered the filesystem read-only in most cases for read failures. This seems like pretty much the opposite of the policy you'd want. Ignoring write failures can easily result in data corruption, and remounting the filesystem as read-only is a drastic overreaction if the read error was a transient error (transient errors are common). Additionally, ext3 did the least consistency checking of the three filesystems and was the most likely to not detect an error. In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn't happen here" in places where errors weren't handled.

NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.

The paper has much more detail on the exact failure modes, but the details are mostly of historical interest as many of the bugs have been fixed.

It would be really great to see an updated version of the paper, and in one presentation someone in the audience asked if there was more up to date information. The presenter replied that they'd be interested in knowing what things look like now, but that it's hard to do that kind of work in academia because grad students don't want to repeat work that's been done before, which is pretty reasonable given the incentives they face. Doing replications is a lot of work, often nearly as much work as the original paper, and replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.

The Gunawi et al. FAST 08 is another paper it would be great to see replicated today. That paper follows up the paper we just looked at, and examines the error handling code in different file systems, using a simple static analysis tool to find cases where errors are being thrown away. Being thrown away is defined very loosely in the paper --- code like the following

if (error) {
    printk(“I have no idea how to handle this error\n”);
}

is considered not throwing away the error. Errors are considered to be ignored if the execution flow of the program doesn't depend on the error code returned from a function that returns an error code.

With that tool, they find that most filesystems drop a lot of error codes:


By % Broken By Viol/Kloc

Rank

FS Frac. FS                Viol/Kloc

1

IBM JFS 24.4 ext3 7.2

2

ext3 22.1 IBM JFS 5.6

3

JFFS v2 15.7 NFS Client 3.6

4

NFS Client 12.9 VFS 2.9

5

CIFS 12.7 JFFS v2 2.2

6

MemMgmt 11.4 CIFS 2.1

7

ReiserFS 10.5 MemMgmt 2.0

8

VFS 8.4 ReiserFS 1.8

9

NTFS 8.1 XFS 1.4

10

XFS 6.9 NFS Server 1.2


Comments they found next to ignored errors include: "Should we pass any errors back?", "Error, skip block and hope for the best.", "There's no way of reporting error returned from ext3_mark_inode_dirty() to user space. So ignore it.", "Note: todo: log error handler.", "We can't do anything about an error here.", "Just ignore errors at this point. There is nothing we can do except to try to keep going.", "Retval ignored?", and "Todo: handle failure."

One thing to note is that in a lot of cases, ignoring an error is more of a symptom of an architectural issue than a bug per se (e.g., ext3 ignored write errors during checkpointing because it didn't have any kind of recovery mechanism). But even so, the authors of the papers found many real bugs.

Error recovery

Every widely used filesystem has bugs that will cause problems on error conditions, which brings up two questions. Can recovery tools robustly fix errors, and how often do errors occur? How do they handle recovery from those problems? The Gunawi et al. OSDI 08 paper looks at that and finds that fsck, a standard utility for checking and repairing file systems, “checks and repairs certain pointers in an incorrect order . . . the file system can even be unmountable after”.

At this point, we know that it's quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it. How often do errors happen?

Error frequency

The Bairavasundaram et al. SIGMETRICS '07 paper found that, depending on the exact model, between 5% and 20% of disks would have at least one error over a two year period. Interestingly, many of these were isolated errors -- 38% of disks with errors had only a single error, and 80% had fewer than 50 errors. A follow-up study looked at corruption and found that silent data corruption that was only detected by checksumming happened on .5% of disks per year, with one extremely bad model showing corruption on 4% of disks in a year.

It's also worth noting that they found very high locality in error rates between disks on some models of disk. For example, there was one model of disk that had a very high error rate in one specific sector, making many forms of RAID nearly useless for redundancy.

That's another study it would be nice to see replicated. Most studies on disk focus on the failure rate of the entire disk, but if what you're worried about is data corruption, errors in non-failed disks are more worrying than disk failure, which is easy to detect and mitigate.

Conclusion

Files are hard. Butler Lampson has remarked that when they came up with threads, locks, and condition variables at PARC, they thought that they were creating a programming model that anyone could use, but that there's now decades of evidence that they were wrong. We've accumulated a lot of evidence that humans are very bad at reasoning about these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems. Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box. Translated to filesystems, that's equivalent to saying that as an application developer, writing to files safely is hard enough that it should be done via some kind of library and/or database, not by directly making syscalls.

Sqlite is quite good in terms of reliability if you want a good default. However, some people find it to be too heavyweight if all they want is a file-based abstraction. What they really want is a sort of polyfill for the file abstraction that works on top of all filesystems without having to understand the differences between different configurations (and even different versions) of each filesystem. Since that doesn't exist yet, when no existing library is sufficient, you need to checksum your data since you will get silent errors and corruption. The only questions are whether or not you detect the errors and whether or not your record format only destroys a single record when corruption happens, or if it destroys the entire database. As far as I can tell, most desktop email client developers have chosen to go the route of destroying all of your email if corruption happens.

These studies also hammer home the point that conventional testing isn't sufficient. There were multiple cases where the authors of a paper wrote a relatively simple tool and found a huge number of bugs. You don't need any deep computer science magic to write the tools. The error propagation checker from the paper that found a ton of bugs in filesystem error handling was 4k LOC. If you read the paper, you'll see that the authors observed that the tool had a very large number of shortcomings because of its simplicity, but despite those shortcomings, it was able to find a lot of real bugs. I wrote a vaguely similar tool at my last job to enforce some invariants, and it was literally two pages of code. It didn't even have a real parser (it just went line-by-line through files and did some regexp matching to detect the simple errors that it's possible to detect with just a state machine and regexes), but it found enough bugs that it paid for itself in development time the first time I ran it.

Almost every software project I've seen has a lot of low hanging testing fruit. Really basic random testing, static analysis, and fault injection can pay for themselves in terms of dev time pretty much the first time you use them.

Appendix

I've probably covered less than 20% of the material in the papers I've referred to here. Here's a bit of info about some other neat info you can find in those papers, and others.

Pillai et al., OSDI '14: this papers goes into much more detail about what's required for crash consistency than this post does. It also gives a fair amount of detail about how exactly applications fail, including diagrams of traces that indicate what false assumptions are embedded in each trace.

Chidambara et al., FAST '12: the same filesystem primitives are responsible for both consistency and ordering. The authors propose alternative primitives that separate these concerns, allow better performance while maintaining safety.

Rajimwale et al. DSN '01: you probably shouldn't use disks that ignore flush directives, but in case you do, here's a protocol that forces those disks to flush using normal filesystem operations. As you might expect, the performance for this is quite bad.

Prabhakaran et al. SOSP '05: This has a lot more detail on filesystem responses to error than was covered in this post. The authors also discuss JFS, an IBM filesystem for AIX. Although it was designed for high reliability systems, it isn't particularly more reliable than the alternatives. Related material is covered further in DSN '08, StorageSS '06, DSN '06, FAST '08, and USENIX '09, among others.

Gunawi et al. FAST '08 : Again, much more detail than is covered in this post on when errors get dropped, and how they wrote their tools. They also have some call graphs that give you one rough measure of the complexity involved in a filesystem. The XFS call graph is particularly messy, and one of the authors noted in a presentation that an XFS developer said that XFS was fun to work on since they took advantage of every possible optimization opportunity regardless of how messy it made things.

Bairavasundaram et al. SIGMETRICS '07: There's a lot of information on disk error locality and disk error probability over time that isn't covered in this post. A followup paper in FAST08 has more details.

Gunawi et al. OSDI '08: This paper has a lot more detail about when fsck doesn't work. In a presentation, one of the authors mentioned that fsck is the only program that's ever insulted him. Apparently, if you have a corrupt pointer that points to a superblock, fsck destroys the superblock (possibly rendering the disk unmountable), tells you something like "you dummy, you must have run fsck on a mounted disk", and then gives up. In the paper, the authors reimplement basically all of fsck using a declarative model, and find that the declarative version is shorter, easier to understand, and much easier to extend, at the cost of being somewhat slower.

Memory errors are beyond the scope of this post, but memory corruption can cause disk corruption. This is especially annoying because memory corruption can cause you to take a checksum of bad data and write a bad checksum. It's also possible to corrupt in memory pointers, which often results in something very bad happening. See the Zhang et al. FAST '10 paper for more on how ZFS is affected by that. There's a meme going around that ZFS is safe against memory corruption because it checksums, but that paper found that critical things held in memory aren't checksummed, and that memory errors can cause data corruption in real scenarios.

The sqlite devs are serious about both documentation and testing. If I wanted to write a reliable desktop application, I'd start by reading the sqlite docs and then talking to some of the core devs. If I wanted to write a reliable distributed application I'd start by getting a job at Google and then reading the design docs and postmortems for GFS, Colossus, Spanner, etc. J/k, but not really.

We haven't looked at formal methods at all, but there have been a variety of attempts to formally verify properties of filesystems, such as SibylFS.

This list isn't intended to be exhaustive. It's just a list of things I've read that I think are interesting.

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that's probably the most common comment I've gotten on this post. If you think this solves the problem, I'm going to ask you to pause for five seconds and consider the problems this might have.

The main problems this has are:

  • rename isn't atomic on crash. POSIX says that rename is atomic, but this only applies to normal operation, not to crashes.
  • even if the techinque worked, the performance is very poor
  • how do you handle hardlinks?
  • metadata can be lost; this can sometimes be preserved, under some filesystems, with ioctls, but now you have filesystem specific code just for the non-crash case
  • etc.

The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they're explicitly warned that people tend to underestimate this problem!

This post reproduces some of the results from these papers on modern filesystems as of 2017.

This talk (transcript) contains a number of newer results and discusses hardware issues in more detail.

Thanks to Leah Hanson, Katerina Barone-Adesi, Jamie Brandon, Kamal Marhubi, Joe Wilder, David Turner, Benjamin Gilbert, Tom Murphy, Chris Ball, Joe Doliner, Alexy Romanov, Mindy Preston, Paul McJones, Evan Jones, and Jason Petersen for comments/corrections/discussion.


  1. Turns out some commercially supported distros only support data=ordered. Oh, and when I said data=ordered was the default, that's only the case if pre-2.6.30. After 2.6.30, there's a config option, CONFIG_EXT3_DEFAULTS_TO_ORDERED. If that's not set, the default becomes data=writeback. [return]
  2. Cases where overwrite atomicity is required were documented as known issues, and all such cases assumed single-block atomicity and not multi-block atomicity. By contrast, multiple applications (LevelDB, Mercurial, and HSQLDB) had bad data corruption bugs that came from assuming appends are atomic.

    That seems to be an indirect result of a commonly used update protocol, where modifications are logged via appends, and then logged data is written via overwrites. Application developers are careful to check for and handle errors in the actual data, but the errors in the log file are often overlooked.

    There are a number of other classes of errors discussed, and I recommend reading the paper for the details if you work on an application that writes files.

    [return]

Why use ECC?

2015-11-27 08:00:00

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

  1. Google didn't use ECC when they built their servers in 1999
  2. Most RAM errors are hard errors and not soft errors
  3. RAM errors are rare because hardware has improved
  4. If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is "awfully enterprisey"

Let's take a look at these arguments one by one:

1. Google didn't use ECC in 1999

Not too long after Google put these non-ECC machines into production, they realized this was a serious error and not worth the cost savings. If you think cargo culting what Google does is a good idea because it's Google, here are some things you might do:

A. Put your servers into shipping containers.

Articles are still written today about what a great idea this is, even though this was an experiment at Google that was deemed unsuccessful. Turns out, even Google's experiments don't always succeed. In fact, their propensity for “moonshots” in the early days meannt that they had more failed experiments that most companies. Copying their failed experiments isn't a particularly good strategy.

B. Cause fires in your own datacenters

Part of the post talks about how awesome these servers are:

Some people might look at these early Google servers and see an amateurish fire hazard. Not me. I see a prescient understanding of how inexpensive commodity hardware would shape today's internet. I felt right at home when I saw this server; it's exactly what I would have done in the same circumstances

The last part of that is true. But the first part has a grain of truth, too. When Google started designing their own boards, one generation had a regrowth1 issue that caused a non-zero number of fires.

BTW, if you click through to Jeff's post and look at the photo that the quote refers to, you'll see that the boards have a lot of flex in them. That caused problems and was fixed in the next generation. You can also observe that the cabling is quite messy, which also caused problems, and was also fixed in the next generation. There were other problems as well. Jeff's argument here appears to be that, if he were there at the time, he would've seen the exact same opportunities that early Google enigneers did, and since Google did this, it must've been the right thing even if it doesn't look like it. But, a number of things that make it look like not the right thing actually made it not the right thing.

C. Make servers that injure your employees

One generation of Google servers had infamously sharp edges, giving them the reputation of being made of “razor blades and hate”.

D. Create weather in your datacenters

From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.

Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you're going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not something that was done in 1999.

When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries. The actual failure mode here is instructive. I often hear that it's ok to ignore ECC on these machines because it's ok to have errors in individual results. But even when you can tolerate occasional errors, ignoring errors means that you're exposing yourself to total corruption, unless you've done a very careful analysis to make sure that a single error can only contaminate a single result. In research that's been done on filesystems, it's been repeatedly shown that despite making valiant attempts at creating systems that are robust against a single error, it's extremely hard to do so and basically every heavily tested filesystem can have a massive failure from a single error (see the output of Andrea and Remzi's research group at Wisconsin if you're curious about this). I'm not knocking filesystem developers here. They're better at that kind of analysis than 99.9% of programmers. It's just that this problem has been repeatedly shown to be hard enough that humans cannot effectively reason about it, and automated tooling for this kind of analysis is still far from a push-button process. In their book on warehouse scale computing, Google discusses error correction and detection and ECC is cited as their slam dunk case for when it's obvious that you should use hardware error correction2.

Google has great infrastructure. From what I've heard of the infra at other large tech companies, Google's sounds like the best in the world. But that doesn't mean that you should copy everything they do. Even if you look at their good ideas, it doesn't make sense for most companies to copy them. They created a replacement for Linux's work stealing scheduler that uses both hardware run-time information and static traces to allow them to take advantage of new hardware in Intel's server processors that lets you dynamically partition caches between cores. If used across their entire fleet, that could easily save Google more money in a week than stackexchange has spent on machines in their entire history. Does that mean you should copy Google? No, not unless you've already captured all the lower hanging fruit, which includes things like making sure that your core infrastructure is written in highly optimized C++, not Java or (god forbid) Ruby. And the thing is, for the vast majority of companies, writing in a language that imposes a 20x performance penalty is a totally reasonable decision.

2. Most RAM errors are hard errors

The case against ECC quotes this section of a study on DRAM errors (the bolding is Jeff's):

Our study has several main findings. First, we find that approximately 70% of DRAM faults are recurring (e.g., permanent) faults, while only 30% are transient faults. Second, we find that large multi-bit faults, such as faults that affects an entire row, column, or bank, constitute over 40% of all DRAM faults. Third, we find that almost 5% of DRAM failures affect board-level circuitry such as data (DQ) or strobe (DQS) wires. Finally, we find that chipkill functionality reduced the system failure rate from DRAM faults by 36x.

This seems to betray a lack of understanding of the implications of this study, as this quote doesn't sound like an argument against ECC; it sounds like an argument for "chipkill", a particular class of ECC. Putting that aside, Jeff's post points out that hard errors are twice as common as soft errors, and then mentions that they run memtest on their machines when they get them. First, a 2:1 ratio isn't so large that you can just ignore soft errors. Second the post implies that Jeff believes that hard errors are basically immutable and can't surface after some time, which is incorrect. You can think of electronics as wearing out just the same way mechanical devices wear out. The mechanisms are different, but the effects are similar. In fact, if you compare reliability analysis of chips vs. other kinds of reliability analysis, you'll find they often use the same families of distributions to model failures. And, if hard errors were immutable, they would generally get caught in testing by the manufacturer, who can catch errors much more easily than consumers can because they have hooks into circuits that let them test memory much more efficiently than you can do in your server or home computer. Third, Jeff's line of reasoning implies that ECC can't help with detection or correction of hard errors, which is not only incorrect but directly contradicted by the quote.

So, how often are you going to run memtest on your machines to try to catch these hard errors, and how much data corruption are you willing to live with? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before silent corruption occurs. No one's going to consent to shutting down everything on a machine every day to run memtest (that would be more expensive than just buying ECC memory), and even if you could convince people to do that, it won't catch as many errors as ECC will.

When I worked at a company that owned about 1000 machines, we noticed that we were getting strange consistency check failures, and after maybe half a year we realized that the failures were more likely to happen on some machines than others. The failures were quite rare, maybe a couple times a week on average, so it took a substantial amount of time to accumulate the data, and more time for someone to realize what was going on. Without knowing the cause, analyzing the logs to figure out that the errors were caused by single bit flips (with high probability) was also non-trivial. We were lucky that, as a side effect of the process we used, the checksums were calculated in a separate process, on a different machine, at a different time, so that an error couldn't corrupt the result and propagate that corruption into the checksum. If you merely try to protect yourself with in-memory checksums, there's a good chance you'll perform a checksum operation on the already corrupted data and compute a valid checksum of bad data unless you're doing some really fancy stuff with calculations that carry their own checksums (and if you're that serious about error correction, you're probably using ECC regardless). Anyway, after completing the analysis, we found that memtest couldn't detect any problems, but that replacing the RAM on the bad machines caused a one to two order of magnitude reduction in error rate. Most services don't have this kind of checksumming we had; those services will simply silently write corrupt data to persistent storage and never notice problems until a customer complains.

3. Due to advances in hardware manufacturing, errors are very rare

Jeff says

I do seriously question whether ECC is as operationally critical as we have been led to believe [for servers], and I think the data shows modern, non-ECC RAM is already extremely reliable ... Modern commodity computer parts from reputable vendors are amazingly reliable. And their trends show from 2012 onward essential PC parts have gotten more reliable, not less. (I can also vouch for the improvement in SSD reliability as we have had zero server SSD failures in 3 years across our 12 servers with 24+ drives ...

and quotes a study.

The data in the post isn't sufficient to support this assertion. Note that since RAM usage has been increasing and continues to increase at a fast exponential rate, RAM failures would have to decrease at a greater exponential rate to actually reduce the incidence of data corruption. Furthermore, as chips continue shrink, features get smaller, making the kind of wearout issues discussed in “2” more common. For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.

The 2012 study that Atwood quoted has this graph on corrected errors (a subset of all errors) on ten randomly selected failing nodes (6% of nodes had at least one failure):

We're talking between 10 and 10k errors for a typical node that has a failure, and that's a cherry-picked study from a post that's arguing that you don't need ECC. Note that the nodes here only have 16GB of RAM, which is an order of magnitude less than modern servers often have, and that this was on an older process node that was less vulnerable to noise than we are now. For anyone who's used to dealing with reliability issues and just wants to know the FIT rate, the study finds a FIT rate of between 0.057 and 0.071 faults per Mbit (which, contra Atwood's assertion, is not a shockingly low number). If you take the most optimistic FIT rate, .057, and do the calculation for a server without much RAM (here, I'm using 128GB, since the servers I see nowadays typically have between 128GB and 1.5TB of RAM)., you get an expected value of .057 * 1000 * 1000 * 8760 / 1000000000 = .5 faults per year per server. Note that this is for faults, not errors. From the graph above, we can see that a fault can easily cause hundreds or thousands of errors per month. Another thing to note is that there are multiple nodes that don't have errors at the start of the study but develop errors later on. So, in fact, the cherry-picked study that Jeff links contradicts Jeff's claim about reliability.

Sun/Oracle famously ran into this a number of decades ago. Transistors and DRAM capacitors were getting smaller, much as they are now, and memory usage and caches were growing, much as they are now. Between having smaller transistors that were less resilient to transient upset as well as more difficult to manufacture, and having more on-chip cache, the vast majority of server vendors decided to add ECC to their caches. Sun decided to save a few dollars and skip the ECC. The direct result was that a number of Sun customers reported sporadic data corruption. It took Sun multiple years to spin a new architecture with ECC cache, and Sun made customers sign an NDA to get replacement chips. Of course there's no way to cover up this sort of thing forever, and when it came up, Sun's reputation for producing reliable servers took a permanent hit, much like the time they tried to cover up poor performance results by introducing a clause into their terms of services disallowing benchmarking.

Another thing to note here is that when you're paying for ECC, you're not just paying for ECC, you're paying for parts (CPUs, boards) that have been qual'd more thoroughly. You can easily see this with disk failure rates, and I've seen many people observe this in their own private datasets. In terms of public data, I believe Andrea and Remzi's group had a SIGMETRICS paper a few years back that showed that SATA drives were 4x more likely than SCSI drives to have disk read failures, and 10x more likely to have silent data corruption. This relationship held true even with drives from the same manufacturer. There's no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it's not about the interface. It's about buying a high-reliability server part vs. a consumer part. Maybe you don't care about disk reliability in particular because you checksum everything and can easily detect disk corruption, but there are some kinds of corruption that are harder to detect.

[2024 update, almost a decade later]: looking at this retrospectively, we can see that Jeff's assertion that commodity parts are reliable, "modern commodity computer parts from reputable vendors are amazingly reliable" is still not true. Looking at real-world user data from Firefox, Gabriele Svelto estimated that approximately 10% to 20% of all Firefox crashes were due to memory corruption. Various game companies that track this kind of thing also report a significant fraction of user crashes appear to be due to data corruption, although I don't have an estimate from any of those companies handy. A more direct argument is that if you talk to folks at big companies that run a lot of ECC memory and look at the rate of ECC errors, there are quite a few errors detected by ECC memory despite ECC memory typically having a lower error rate than random non-ECC memory. This kind of argument is frequently made (here, it was detailed above a decade ago, and when I looked at this when I worked at Twitter fairly recently and there has not been a revolution in memory technology that has reduced the need for ECC over the rates discussed in papers a decade ago), but it often doesn't resontate with folks who say things like "well, those bits probably didn't matter anyway", "most memory ends up not getting read", etc. Looking at real-world crashes and noting that the amount of silent data corruption should be expected to be much higher than the rate of crashes seems to resonate with people who aren't excited by looking at raw FIT rates in datacenters.

4. If ECC were actually important, it would be used everywhere and not just servers.

One way to rephrase this is as a kind of cocktail party efficient markets hypothesis. This can't be important, because if it was, we would have it. Of course this is incorrect and there are many things that would be beneficial to consumers that we don't have, such as cars that are designed to safe instead of just getting the maximum score in crash tests. Looking at this with respect to the server and consumer markets, this argument can be rephrased as “If this feature were actually important for servers, it would be used in non-servers”, which is incorrect. A primary driver of what's available in servers vs. non-servers is what can be added that buyers of servers will pay a lot for, to allow for price discrimination between server and non-server parts. This is actually one of the more obnoxious problems facing large cloud vendors — hardware vendors are able to jack up the price on parts that have server features because the features are much more valuable in server applications than in desktop applications. Most home users don't mind, giving hardware vendors a mechanism to extract more money out of people who buy servers while still providing cheap parts for consumers.

Cloud vendors often have enough negotiating leverage to get parts at cost, but that only works where there's more than one viable vendor. Some of the few areas where there aren't any viable competitors include CPUs and GPUs. There have been a number of attempts by CPU vendors to get into the server market, but each attempt so far has been fatally flawed in a way that made it obvious from an early stage that the attempt was doomed (and these are often 5 year projects, so that's a lot of time to spend on a doomed project). The Qualcomm effort has been getting a lot of hype, but when I talk to folks I know at Qualcomm they all tell me that the current chip is basically for practice, since Qualcomm needed to learn how to build a server chip from all the folks they poached from IBM, and that the next chip is the first chip that has any hope of being competitive. I have high hopes for Qualcomm as well an ARM effort to build good server parts, but those efforts are still a ways away from bearing fruit.

The near total unsuitability of current ARM (and POWER) options (not including hypothetical variants of Apple's impressive ARM chip) for most server workloads in terms of performance per TCO dollar is a bit of a tangent, so I'll leave that for another post, but the point is that Intel has the market power to make people pay extra for server features, and they do so. Additionally, some features are genuinely more important for servers than for mobile devices with a few GB of RAM and a power budget of a few watts that are expected to randomly crash and reboot periodically anyway.

Conclusion

Should you buy ECC RAM? That depends. For servers, it's probably a good bet considering the cost, although it's hard to really do a cost/benefit analysis because it's really hard to figure out the cost of silent data corruption, or the cost of having some risk of burning half a year of developer time tracking down intermittent failures only to find that the were caused by using non-ECC memory.

For normal desktop use, I'm pro-ECC, but if you don't have regular backups set up, doing backups probably has a better ROI than ECC. But once you have the absolute basics set up, there's a fairly strong case for ECC for consumer machines. For example, if you have backups without ECC, you can easily write corrupt data into your primary store and replicate that corrupt data into backup. But speaking more generally, big companies running datacenters are probably better set up to detect data corruption and more likely to have error correction at higher levels that allow them to recover from data corruption than consumers, so the case for consumers is arguably stronger than it is for servers, where the case is strong enough that's generally considered a no brainer. A major reason consumers don't generally use ECC isn't that it isn't worth it for them, it's that they just have no idea how to attribute crashes and data corruption when they happen. Once you start doing this, as Google and other large companies do, it's immediately obvious that ECC is worth the cost even when you have multiple levels of error correction operating at higher levels. This is analogous to what we see with files, where big tech companies write software for their datacenters that's much better at dealing with data corruption than big tech companies that write consumer software (and this is often true within the same company). To the user, the cost of having their web app corrupt their data isn't all that different from when their desktop app corrupts their data, the difference is that when their web app corrupts data, it's clearer that it's the company's fault, which changes the incentives for companies.

Appendix: security

If you allow any sort of code execution, even sandboxed execution, there are attacks like rowhammer which can allow users to cause data corruption and there have been instances where this has allowed for privilege escalation. ECC doesn't completely mitigate the attack, but it makes it much harder.

Thanks to Prabhakar Ragde, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder, and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe anti-thanks) to Leah for convincing me that I should write up this off the cuff verbal comment as a blog post. Apologies for any errors, the lack of references, and the stilted prose; this is basically a transcription of half of a conversation and I haven't explained terms, provided references, or checked facts in the level of detail that I normally do.


  1. One of the funnier examples I can think of this, at least to me, is the magical self-healing fuse. Although there are many implementations, you can think of a fuse on a chip as basically a resistor. If you run some current through it, you should get a connection. If you run a lot of current through it, you'll heat up the resistor and eventually destroy it. This is commonly used to fuse off features on chips, or to do things like set the clock rate, with the idea being that once a fuse is blown, there's no way to unblow the fuse.

    Once upon a time, there was a semiconductor manufacturer that rushed their manufacturing process a bit and cut the tolerances a bit too fine in one particular process generation. After a few months (or years), the connection between the two ends of the fuse could regrow and cause the fuse to unblow. If you're lucky, the fuse will be something like the high-order bit of the clock multiplier, which will basically brick the chip if changed. If you're not lucky, it will be something that results in silent data corruption.

    I heard about problems in that particular process generation from that manufacturer from multiple people at different companies, so this wasn't an isolated thing. When I say this is funny, I mean that it's funny when you hear this story at a bar. It's maybe less funny when you discover, after a year of testing, that some of your chips are failing because their fuse settings are nonsensical, and you have to respin your chip and delay the release for 3 months. BTW, this fuse regrowth thing is another example of a class of error that can be mitigated with ECC.

    This is not the issue that Google had; I only mention this because a lot of people I talk to are surprised by the ways in which hardware can fail.

    [return]
  2. In case you don't want to dig through the whole book, most of the relevant passage is:

    In a system that can tolerate a number of failures at the software level, the minimum requirement made to the hardware layer is that its faults are always detected and reported to software in a timely enough manner as to allow the software infrastructure to contain it and take appropriate recovery actions. It is not necessarily required that hardware transparently corrects all faults. This does not mean that hardware for such systems should be designed without error correction capabilities. Whenever error correction functionality can be offered within a reasonable cost or complexity, it often pays to support it. It means that if hardware error correction would be exceedingly expensive, the system would have the option of using a less expensive version that provided detection capabilities only. Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost. Relaxing the requirement that hardware errors be detected, however, would be much more difficult because it means that every software component would be burdened with the need to check its own correct execution. At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google's Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked—instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.

    [return]

What's worked in Computer Science: 1999 v. 2015

2015-11-23 08:00:00

In 1999, Butler Lampson gave a talk about the past and future of “computer systems research”. Here are his opinions from 1999 on "what worked".

Yes Maybe No
Virtual memory Parallelism Capabilities
Address spaces RISC Fancy type systems
Packet nets Garbage collection Functional programming
Objects / subtypes Reuse Formal methods
RDB and SQL Software engineering
Transactions RPC
Bitmaps and GUIs Distributed computing
Web Security
Algorithms


Basically everything that was a Yes in 1999 is still important today. Looking at the Maybe category, we have:

Parallelism

This is, unfortunately, still a Maybe. Between the end of Dennard scaling and the continued demand for compute, chips now expose plenty of the parallelism to the programmer. Concurrency has gotten much easier to deal with, but really extracting anything close to the full performance available isn't much easier than it was in 1999.

In 2009, Erik Meijer and Butler Lampson talked about this, and Lampson's comment was that when they came up with threading, locks, and conditional variables at PARC, they thought they were creating something that programmers could use to take advantage of parallelism, but that they now have decades of evidence that they were wrong. Lampson further remarks that to do parallel programming, what you need to do is put all your parallelism into a little box and then have a wizard go write the code in that box. Not much has changed since 2009.

Also, note that I'm using the same criteria to judge all of these. Whenever you say something doesn't work, someone will drop in say that, no wait, here's a PhD that demonstrates that someone has once done this thing, or here are nine programs that demonstrate that Idris is, in fact, widely used in large scale production systems. I take Lampson's view, which is that if the vast majority of programmers are literally incapable of using a certain class of technologies, that class of technologies has probably not succeeded.

On recent advancements in parallelism, Intel recently added features that make it easier to take advantage of trivial parallelism by co-scheduling multiple applications on the same machine without interference, but outside of a couple big companies, no one's really taking advantage of this yet. They also added hardware support for STM recently, but it's still not clear how much STM helps with usability when designing large scale systems.

RISC

If this was a Maybe in 1999 it's certainly a No now. In the 80s and 90s a lot of folks, probably the majority of folks, believed RISC was going to take over the world and x86 was doomed. In 1991, Apple, IBM, and Motorola got together to create PowerPC (PPC) chips that were going to demolish Intel in the consumer market. They opened the Somerset facility for chip design, and collected a lot of their best folks for what was going to be a world changing effort. At the upper end of the market, DEC's Alpha chips were getting twice the performance of Intel's, and their threat to the workstation market was serious enough that Microsoft ported Windows NT to the Alpha. DEC started a project to do dynamic translation from x86 to Alpha; at the time the project started, the projected performance of x86 basically running in emulation on Alpha was substantially better than native x86 on Intel chips.

In 1995, Intel released the Pentium Pro. At the time, it had better workstation integer performance than anything else out there, including much more expensive chips targeted at workstations, and its floating point performance was within a factor of 2 of high-end chips. That immediately destroyed the viability of the mainstream Apple/IBM/Moto PPC chips, and in 1998 IBM pulled out of the Somerset venture1 and everyone gave up on really trying to produce desktop class PPC chips. Apple continued to sell PPC chips for a while, but they had to cook up bogus benchmarks to make the chips look even remotely competitive. By the time DEC finished their dynamic translation efforts, x86 in translation was barely faster than native x86 in floating point code, and substantially slower in integer code. While that was a very impressive technical feat, it wasn't enough to convince people to switch from x86 to Alpha, which killed DEC's attempts to move into the low-end workstation and high-end PC market.

In 1999, high-end workstations were still mostly RISC machines, and supercomputers were a mix of custom chips, RISC chips, and x86 chips. Today, Intel dominates the workstation market with x86, and the supercomputer market has also moved towards x86. Other than POWER, RISC ISAs were mostly wiped out (like PA-RISC) or managed to survive by moving to the low-margin embedded market (like MIPS), which wasn't profitable enough for Intel to pursue with any vigor. You can see a kind of instruction set arbitrage that MIPS and ARM have been able to take advantage of because of this. Cavium and ARM will sell you a network card that offloads a lot of processing to the NIC, which have a bunch of cheap MIPS and ARM processors, respectively, on board. The low-end processors aren't inherently better at processing packets than Intel CPUS; they're just priced low enough that Intel won't compete on price because they don't want to cannibalize their higher margin chips with sales of lower margin chips. MIPS and ARM have no such concerns because MIPS flunked out of the high-end processor market and ARM has yet to get there. If the best thing you can say about RISC chips is that they manage to exist in areas where the profit margins are too low for Intel to care, that's not exactly great evidence of a RISC victory. That Intel ceded the low end of the market might seem ironic considering Intel's origins, but they've always been aggressive about moving upmarket (they did the same thing when they transitioned from DRAM to SRAM to flash, ceding the barely profitable DRAM market to their competitors).

If there's any threat to x86, it's ARM, and it's their business model that's a threat, not their ISA. And as for their ISA, ARM's biggest inroads into mobile and personal computing came with ARMv7 and earlier ISAs, which aren't really more RISC-like than x862. In the area in which they dominated, their "modern" RISC-y ISA, ARMv8, is hopeless and will continue to be hopeless for years, and they'll continue to dominate with their non-RISC ISAs.

In retrospect, the reason RISC chips looked so good in the 80s was that you could fit a complete high-performance RISC microprocessor onto a single chip, which wasn't true of x86 chips at the time. But as we got more transistors, this mattered less.

It's possible to nitpick RISC being a no by saying that modern processors translate x86 ops into RISC micro-ops internally, but if you listened to talk at the time, people thought that having an external RISC ISA would be so much lower overhead that RISC would win, which has clearly not happened. Moreover, modern chips also do micro-op fusion in order to fuse operations into decidedly un-RISC-y operations. A clean RISC ISA is a beautiful thing. I sometimes re-read Dick Sites's explanation of the Alpha design just to admire it, but it turns out beauty isn't really critical for the commercial success of an ISA.

Garbage collection

This is a huge Yes now. Every language that's become successful since 1999 has GC and is designed for all normal users to use it to manage all memory. In five years, Rust or D might make that last sentence untrue, but even if that happens, GC will still be in the yes category.

Reuse

Yes, I think, although I'm not 100% sure what Lampson was referring to here. Lampson said that reuse was a maybe because it sometimes works (for UNIX filters, OS, DB, browser) but was also flaky (for OLE/COM). There are now widely used substitutes for OLE; service oriented architectures also seem to fit his definition of re-use.

Looking at the No category, we have:

Capabilities

Yes. Widely used on mobile operating systems.

Fancy type systems

It depends on what qualifies as a fancy type system, but if “fancy” means something at least as fancy as Scala or Haskell, this is a No. That's even true if you relax the standard to an ML-like type system. Boy, would I love to be able to do everyday programming in an ML (F# seems particularly nice to me), but we're pretty far from that.

In 1999 C, and C++ were mainstream, along with maybe Visual Basic and Pascal, with Java on the rise. And maybe Perl, but at the time most people thought of it as a scripting language, not something you'd use for "real" development. PHP, Python, Ruby, and JavaScript all existed, but were mostly used in small niches. Back then, Tcl was one of the most widely used scripting languages, and it wasn't exactly widely used. Now, PHP, Python, Ruby, and JavaScript are not only more mainstream than Tcl, but more mainstream than C and C++. C# is probably the only other language in the same league as those languages in terms of popularity, and Go looks like the only language that's growing fast enough to catch up in the foreseeable future. Since 1999, we have a bunch of dynamic languages, and a few languages with type systems that are specifically designed not to be fancy.

Maybe I'll get to use F# for non-hobby projects in another 16 years, but things don't look promising.

Functional programming

I'd lean towards Maybe on this one, although this is arguably a No. Functional languages are still quite niche, but functional programming ideas are now mainstream, at least for the HN/reddit/twitter crowd.

You might say that I'm being too generous to functional programming here because I have a soft spot for immutability. That's fair. In 1982, James Morris wrote:

Functional languages are unnatural to use; but so are knives and forks, diplomatic protocols, double-entry bookkeeping, and a host of other things modern civilization has found useful. Any discipline is unnatural, in that it takes a while to master, and can break down in extreme situations. That is no reason to reject a particular discipline. The important question is whether functional programming in unnatural the way Haiku is unnatural or the way Karate is unnatural.

Haiku is a rigid form poetry in which each poem must have precisely three lines and seventeen syllables. As with poetry, writing a purely functional program often gives one a feeling of great aesthetic pleasure. It is often very enlightening to read or write such a program. These are undoubted benefits, but real programmers are more results-oriented and are not interested in laboring over a program that already works.

They will not accept a language discipline unless it can be used to write programs to solve problems the first time -- just as Karate is occasionally used to deal with real problems as they present themselves. A person who has learned the discipline of Karate finds it directly applicable even in bar-room brawls where no one else knows Karate. Can the same be said of the functional programmer in today's computing environments? No.

Many people would make the same case today. I don't agree, but that's a matter of opinion, not a matter of fact.

Formal methods

Maybe? Formal methods have had high impact in a few areas. Model checking is omnipresent in chip design. Microsoft's driver verification tool has probably had more impact than all formal chip design tools combined, clang now has a fair amount of static analysis built in, and so on and so forth. But, formal methods are still quite niche, and the vast majority of developers don't apply formal methods.

Software engineering

No. In 1995, David Parnas had a talk at ICSE (the premier software engineering conference) about the fact that even the ICSE papers that won their “most influential paper award” (including two of Parnas's papers) had very little impact on industry.

Basically all of Parnas's criticisms are still true today. One of his suggestions, that there should be distinct conferences for researchers and for practitioners has been taken up, but there's not much cross-pollination between academic conferences like ICSE and FSE and practitioner-focused conferences like StrangeLoop and PyCon.

RPC

Yes. In fact RPCs are now so widely used that I've seen multiple RPCs considered harmful talks.

Distributed systems

Yes. These are so ubiquitous that startups with zero distributed systems expertise regularly use distributed systems provided by Amazon or Microsoft, and it's totally fine. The systems aren't perfect and there are some infamous downtime incidents, but if you compare the bit error rate of random storage from 1999 to something like EBS or Azure Blob Storage, distributed systems don't look so bad.

Security

Maybe? As with formal methods, a handful of projects with very high real world impact get a lot of mileage out of security research. But security still isn't a first class concern for most programmers.

Conclusion

What's worked in computer systems research?

Topic 1999 2015
Virtual memory Yes Yes
Address spaces Yes Yes
Packet nets Yes Yes
Objects / subtypes Yes Yes
RDB and SQL Yes Yes
Transactions Yes Yes
Bitmaps and GUIs Yes Yes
Web Yes Yes
Algorithms Yes Yes
Parallelism Maybe Maybe
RISC Maybe No
Garbage collection Maybe Yes
Reuse Maybe Yes
Capabilities No Yes
Fancy type systems No No
Functional programming No Maybe
Formal methods No Maybe
Software engineering No No
RPC No Yes
Distributed computing No Yes
Security No Maybe


Not only is every Yes from 1999 still Yes today, seven of the Maybes and Nos were upgraded, and only one was downgraded. And on top of that, there are a lot of topics like neural networks that weren't even worth adding to the list as a No that are an unambiguous Yes today.

In 1999, I was taking the SATs and applying to colleges. Today, I'm not really all that far into my career, and the landscape has changed substantially; many previously impractical academic topics are now widely used in industry. I probably have twice again as much time until the end of my career and things are changing faster now than they were in 1999. After reviewing Lampson's 1999 talk, I'm much more optimistic about research areas that haven't yielded much real-world impact (yet), like capability based computing and fancy type systems. It seems basically impossible to predict what areas will become valuable over the next thirty years.

Correction

This post originally had Capabilities as a No in 2015. In retrospect, I think that was a mistake and it should have been a Yes due to use on mobile.

Thanks to Seth Holloway, Leah Hanson, Ian Whitlock, Lindsey Kuper, Chris Ball, Steven McCarthy, Joe Wilder, David Wragg, Sophia Wisdom, and Alex Clemmer for comments/discussion.


  1. I know a fair number of folks who were relocated to Somerset from the east coast by IBM because they later ended up working at a company I worked at. It's interesting to me that software companies don't have the same kind of power over employees, and can't just insist that employees move to a new facility they're creating in some arbitrary location. [return]
  2. I once worked for a company that implemented both x86 and ARM decoders (I'm guessing it was the first company to do so for desktop class chips), and we found that our ARM decoder was physically larger and more complex than our x86 decoder. From talking to other people who've also implemented both ARM and x86 frontends, this doesn't seem to be unusual for high performance implementations. [return]

Infinite disk

2015-11-01 08:00:00

Hardware performance “obviously” affects software performance and affects how software is optimized. For example, the fact that caches are multiple orders of magnitude faster than RAM means that blocked array accesses give better performance than repeatedly striding through an array.

Something that's occasionally overlooked is that hardware performance also has profound implications for system design and architecture. Let's look at this table of latencies that's been passed around since 2012:

Operation                                Latency (ns)     (ms)
L1 cache reference                            0.5 ns
Branch mispredict                             5   ns
L2 cache reference                            7   ns
Mutex lock/unlock                            25   ns
Main memory reference                       100   ns
Compress 1K bytes with Zippy              3,000   ns
Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
Read 4K randomly from SSD               150,000   ns    0.15 ms
Read 1 MB sequentially from memory      250,000   ns    0.25 ms
Round trip within same datacenter       500,000   ns    0.5  ms
Read 1 MB sequentially from SSD       1,000,000   ns    1    ms
Disk seek                            10,000,000   ns   10    ms
Read 1 MB sequentially from disk     20,000,000   ns   20    ms
Send packet CA->Netherlands->CA     150,000,000   ns  150    ms

Consider the latency of a disk seek (10ms) vs. the latency of a round-trip within the same datacenter (.5ms). The round-trip latency is so much lower than the seek time of a disk that we can dis-aggregate storage and distribute it anywhere in the datacenter without noticeable performance degradation, giving applications the appearance of having infinite disk space without any appreciable change in performance. This fact was behind the rise of distributed filesystems like GFS within the datacenter over the past two decades, and various networked attached storage schemes long before.

However, doing the same thing on a 2012-era commodity network with SSDs doesn't work. The time to read a page on an SSD is 150us, vs. a 500us round-trip time on the network. That's still a noticeable performance improvement over spinning metal disk, but it's over 4x slower than local SSD.

But here we are in 2015. Things have changed. Disks have gotten substantially faster. Enterprise NVRAM drives can do a 4k random read in around 15us, an order of magnitude faster than 2012 SSDs. Networks have improved even more. It's now relatively common to employ a low-latency user-mode networking stack, which drives round-trip latencies for a 4k transfer down to 10s of microseconds. That's fast enough to disaggregate SSD and give applications access to infinite SSD. It's not quite fast enough to disaggregate high-end NVRAM, but RDMA can handle that.

RDMA drives latencies down another order of magnitude, putting network latencies below NVRAM access latencies by enough that we can disaggregate NVRAM. Note that these numbers are for an unloaded network with no congestion -- these numbers will get substantially worse under load, but they're illustrative of what's possible. This isn't exactly new technology: HPC folks have been using RDMA over InfiniBand for years, but InfiniBand networks are expensive enough that they haven't seen a lot of uptake in datacenters. Something that's new in the past few years is the ability to run RDMA over Ethernet. This turns out to be non-trivial; both Microsoft and Google have papers in this year's SIGCOMM on how to do this without running into the numerous problems that occur when trying to scale this beyond a couple nodes. But it's possible, and we're approaching the point where companies that aren't ridiculously large are going to be able to deploy this technology at scale1.

However, while it's easy to say that we should use disaggregated disk because the ratio of network latency to disk latency has changed, it's not as easy as just taking any old system and throwing it on a fast network. If we take a 2005-era distributed filesystem or distributed database and throw it on top of a fast network, it won't really take advantage of the network. That 2005 system is going to have assumptions like the idea that it's fine for an operation to take 500ns, because how much can 500ns matter? But it matters a lot when your round-trip network latency is only few times more than that and applications written in a higher-latency era are often full of "careless" operations that burn hundreds of nanoseconds at a time. Worse yet, designs that are optimal at higher latencies create overhead as latency decreases. For example, with 1ms latency, adding local caching is a huge win and 2005-era high-performance distributed applications will often rely heavily on local caching. But when latency drops below 1us, the caching that was a huge win in 2005 is often not just pointless, but actually counter-productive overhead.

Latency hasn't just gone down in the datacenter. Today, I get about 2ms to 3ms latency to YouTube. YouTube, Netflix, and a lot of other services put a very large number of boxes close to consumers to provide high-bandwidth low-latency connections. A side effect of this is that any company that owns one of these services has the capability of providing consumers with infinite disk that's only slightly slower than normal disk. There are a variety of reasons this hasn't happened yet, but it's basically inevitable that this will eventually happen. If you look at what major cloud providers are paying for storage, their COGS of providing safely replicated storage is or will become lower than the retail cost to me of un-backed-up unreplicated local disk on my home machine.

It might seem odd that cloud storage can be cheaper than local storage, but large cloud vendors have a lot of leverage. The price for the median component they buy that isn't an Intel CPU or an Nvidia GPU is staggeringly low compared to the retail price. Furthermore, the fact that most people don't access the vast majority of their files most of the time. If you look at the throughput of large HDs nowadays, it's not even possible to do so. A typical consumer 3TB HD has an average throughput of 155MB/s, making the time to read the entire drive 3e12 / 155e6 seconds = 1.9e4 seconds = 5 hours and 22 minutes. And people don't even access their disks at all most of the time! And when they do, their access patterns result in much lower throughput than you get when reading the entire disk linearly. This means that the vast majority of disaggregated storage can live in cheap cold storage. For a neat example of this, the Balakrishnan et al. Pelican OSDI 2014 paper demonstrates that if you build out cold storage racks such that only 8% of the disk can be accessed at any given time, you can get a substantial cost savings. A tiny fraction of storage will have to live at the edge, for the same reason that a tiny fraction of YouTube videos are cached at the edge. In some sense, the economics are worse than for YouTube, since any particular chunk of data is very likely to be shared, but at the rate that edge compute/storage is scaling up, that's unlikely to be a serious objection in a decade.

The most common counter argument to disaggregated disk, both inside and outside of the datacenter, is bandwidth costs. But bandwidth costs have been declining exponentially for decades and continue to do so. Since 1995, we've seen an increase in datacenter NIC speeds go from 10Mb to 40Gb, with 50Gb and 100Gb just around the corner. This increase has been so rapid that, outside of huge companies, almost no one has re-architected their applications to properly take advantage of the available bandwidth. Most applications can't saturate a 10Gb NIC, let alone a 40Gb NIC. There's literally more bandwidth than people know what to do with. The situation outside the datacenter hasn't evolved quite as quickly, but even so, I'm paying $60/month for 100Mb, and if the trend of the last two decades continues, we should see another 50x increase in bandwidth per dollar over the next decade. It's not clear if the cost structure makes cloud-provided disaggregated disk for consumers viable today, but the current trends of implacably decreasing bandwidth cost mean that it's inevitable within the next five years.

One thing to be careful about is that just because we can disaggregate something, it doesn't mean that we should. There was a fascinating paper by Lim et. al at HPCA 2012 on disaggregated RAM where they build out disaggregated RAM by connecting RAM through the backplane. While we have the technology to do this, which has the dual advantages of allowing us to provision RAM at a lower per-unit cost and also getting better utilization out of provisioned RAM, this doesn't seem to provide a performance per dollar savings at an acceptable level of performance, at least so far2.

The change in relative performance of different components causes fundamental changes in how applications should be designed. It's not sufficient to just profile our applications and eliminate the hot spots. To get good performance (or good performance per dollar), we sometimes have to step back, re-examine our assumptions, and rewrite our systems. There's a lot of talk about how hardware improvements are slowing down, which usually refers to improvements in CPU performance. That's true, but there are plenty of other areas that are undergoing rapid change, which requires that applications that care about either performance or cost efficiency need to change. GPUs, hardware accelerators, storage, and networking are all evolving more rapidly than ever.

Update

Microsoft seems to disagree with me on this one. OneDrive has been moving in the opposite direction. They got rid of infinite disk, lowered quotas for non-infinite storage tiers, and changing their sync model in a way that makes this less natural. I spent maybe an hour writing this post. They probably have a team of Harvard MBAs who've spent 100x that much time discussing the move away from infinite disk. I wonder what I'm missing here. Average utilization was 5GB per user, which is practically free. A few users had a lot of data, but if someone uploads, say, 100TB, you can put most of that on tape. Access times on tape are glacial -- seconds for the arm to get the cartridge and put it in the right place, and tens of seconds to seek to the right place on the tape. But someone who uploads 100TB is basically using it as archival storage anyway, and you can mask most of that latency for the most common use cases (uploading libraries of movies or other media). If the first part of the file doesn't live on tape, and the user starts playing a movie that lives on tape, the movie can easily play for a couple minutes off of warmer storage while the tape access gets queued up. You might say that it's not worth it to spend the time it would take to build a system like that (perhaps two engineers working for six months), but you're already going to want a system that can mask the latency to disk-based cold storage for large files. Adding another tier on top of that isn't much additional work.

Update 2

It's happening. In April 2016, Dropbox announced that they're offering "Dropbox Infinite", which lets you access your entire Dropbox regardless of the amount of local disk you have available. The inevitable trend happened, although I'm a bit surprised that it wasn't Google that did it first since they have better edge infrastructure and almost certainly pay less for storage. In retrospect, maybe that's not surprising, though -- Google, Microsoft, and Amazon all treat providing user-friendly storage as a second class citizen, while Dropbox is all-in on user friendliness.

Thanks to Leah Hanson, bbrazil, Kamal Marhubi, mjn, Steve Reinhardt, Joe Wilder, and Jesse Luehrs for comments/corrections/additions that resulted in edits to this.


  1. If you notice that when you try to reproduce the Google result, you get instability, you're not alone. The paper leaves out the special sauce required to reproduce the result. [return]
  2. If your goal is to get better utilization, the poor man's solution today is to give applications access to unused RAM via RDMA on a best effort basis, in a way that's vaguely kinda sorta analogous to Google's Heracles work. You might say, wait a second: you could make that same argument for disk, but in fact the cheapest way to build out disk is to build out very dense storage blades full of disks, not to just use RDMA to access the normal disks attached to standard server blades; why shouldn't that be true for RAM? For an example of what it looks like when disks, I/O, and RAM are underprovisioned compared to CPUs, see this article where a Mozilla employee claims that it's fine to have 6% CPU utilization because those machines are busy doing I/O. Sure, it's fine, if you don't mind paying for CPUs you're not using instead of building out blades that have the correct ratio of disk to storage, but those idle CPUs aren't free.

    If the ratio of RAM to CPU we needed were analogous to the ratio of disk to CPU that we need, it might be cheaper to disaggregate RAM. But, while the need for RAM is growing faster than the need for compute, we're still not yet at the point where datacenters have a large number of cores sitting idle due to lack of RAM, the same way we would have cores sitting idle due to lack of disk if we used standard server blades for storage. A Xeon-EX can handle 1.5TB of RAM per socket. It's common to put two sockets in a 1/2U blade nowadays, and for the vast majority of workloads, it would be pretty unusual to try to cram more than 6TB of RAM into the 4 sockets you can comfortably fit into 1U.

    That being said, the issue of disaggregated RAM is still an open question, and some folks are a lot more confident about its near-term viability than others.

    [return]

Why Intel added cache partitioning

2015-10-04 08:00:00

Typical server utilization is between 10% and 50%. Google has demonstrated 90% utilization without impacting latency SLAs. Xkcd estimated that Google owns 2 million machines. If you estimate an amortized total cost of $4k per machine per year, that's $8 billion per year. With numbers like that, even small improvements have a large impact, and this isn't a small improvement.

How is it possible to get 2x to 9x better utilization on the same hardware? The low end of those typical utilization numbers comes from having a service with variable demand and fixed machine allocations. Say you have 100 machines dedicated to Jenkins. Those machines might be very busy when devs are active, but they might also have 2% utilization at 3am. Dynamic allocation (switching the machines to other work when they're not needed) can get a typical latency-sensitive service up to somewhere in the 30%-70% range. To do better than that across a wide variety of latency-sensitive workloads with tight SLAs, we need some way to schedule low priority work on the same machines, without affecting the latency of the high priority work.

It's not obvious that this is possible. If both high and low priority workloads need to monopolize some shared resources like the last-level cache (LLC), memory bandwidth, disk bandwidth, or network bandwidth, there we're out of luck. With the exception of some specialized services, it's rare to max out disk or network. But what about caches and memory? It turns out that Ferdman et al. looked at this back in 2012 and found that typical server workloads don't benefit from having more than 4MB - 6MB of LLC, despite modern server chips having much larger caches.

For this graph, scale-out workloads are things like distributed key-value stores, MapReduce-like computations, web search, web serving, etc. SPECint(mcf) is a traditional workstation benchmark. “server” is old school server benchmarks like SPECweb and TPC . We can see that going from 4MB to 11MB of LLC has a small effect on typical datacenter workloads, but a significant effect on this traditional workstation benchmark.

Datacenter workloads operate on such large data sets that it's often impossible to fit the dataset in RAM on a single machine, let alone in cache, making a larger LLC not particularly useful. This was result was confirmed by Kanev et al.'s ISCA 2015 paper where they looked at workloads at Google. They also showed that memory bandwidth utilization is, on average, quite low.

You might think that the low bandwidth utilization is because the workloads are compute bound and don't have many memory accesses. However, when the authors looked at what the cores were doing, they found that a lot of time was spent stalled, waiting for cache/memory.

Each row is a Google workload. When running these typical workloads, cores spend somewhere between 46% and 61% of their time blocked on cache/memory. It's curious that we have low cache hit rates, a lot of time stalled on cache/memory, and low bandwidth utilization. This is suggestive of workloads spending a lot of time waiting on memory accesses that have some kind of dependencies that prevent them from being executed independently.

LLCs for high-end server chips are between 12MB and 30MB, even though we only need 4MB to get 90% of the performance, and the 90%-ile utilization of bandwidth is 31%. This seems like a waste of resources. We have a lot of resources sitting idle, or not being used effectively. The good news is that, since we get such low utilization out of the shared resources on our chips, we should be able to schedule multiple tasks on one machine without degrading performance.

Great! What happens when we schedule multiple tasks on one machine? The Lo et al. Heracles paper at ISCA this year explores this in great detail. The goal of Heracles is to get better utilization on machines by co-locating multiple tasks on the same machine.

The figure above shows three latency sensitive (LC) workloads with strict SLAs. websearch is the query serving service in Google search, ml_cluster is real-time text clustering, and memkeyval is a key-value store analogous to memcached. The values are latencies as a percent of maximum allowed by the SLA. The columns indicate the load on the service, and the rows indicate different types of interference. LLC, DRAM, and Network are exactly what they sound like; custom tasks designed to compete only for that resource. HyperThread means that the interfering task is a spinloop running in the other hyperthread on the same core (running in the same hyperthread isn't even considered since OS context switches are too expensive). CPU power is a task that's designed to use a lot of power and induce thermal throttling. Brain is deep learning. All of the interference tasks are run in a container with low priority.

There's a lot going on in this figure, but we can immediately see that the best effort (BE) task we'd like to schedule can't co-exist with any of the LC tasks when only container priorities are used -- all of the brain rows are red, and even at low utilization (the leftmost columns), latency is way above 100% of the SLA latency.. It's also clear that the different LC tasks have different profiles and can handle different types of interference. For example, websearch and ml_cluster are neither network nor compute intensive, so they can handle network and power interference well. However, since memkeyval is both network and compute intensive, it can't handle either network or power interference. The paper goes into a lot more detail about what you can infer from the details of the table. I find this to be one of the most interesting parts of the paper; I'm going to skip over it, but I recommend reading the paper if you're interested in this kind of thing.

A simplifying assumption the authors make is that these types of interference are basically independent. This means that independent mechanisms that isolate LC task from “too much” of each individual type of resource should be sufficient to prevent overall interference. That is, we can set some cap for each type of resource usage, and just stay below each cap. However, this assumption isn't exactly true -- for example, the authors show this figure that relates the LLC cache size to the number cores allocated to an LC task.

The vertical axis is the max load the LC task can handle before violating its SLA when allocated some specific LLC and number of cores. We can see that it's possible to trade off cache vs cores, which means that we can actually go above a resource cap in one dimension and maintain our SLA by using less of another resource. In the general case, we might also be able to trade off other resources. However, the assumption that we can deal with each resource independently reduces a complex optimization problem to something that's relatively straightforward.

Now, let's look at each type of shared resource interference and how Heracles allocates resources to prevent SLA-violating interference.

Core

Pinning the LC and BE tasks to different cores is sufficient to prevent same-core context switching interference and hyperthreading interference. For this, Heracles used cpuset. Cpuset allows you to limit a process (and its children) to only run on a limited set of CPUs.

Network

On the local machines, Heracles used qdisc to enforce quotas. For more on cpuset, qdisc, and other quota/partitioning mechanisms this LWN series on cgroups by Neil Brown is a good place to start. Cgroups are used by a lot of widely used software now (Docker, Kubernetes, Mesos, etc.); they're probably worth learning about even if you don't care about this particular application.

Power

Heracles uses Intel's running average power limit to estimate power. This is a feature on Sandy Bridge (2009) and newer processors that uses some on-chip monitoring hardware to estimate power usage fairly precisely. Per-core dynamic voltage and frequency scaling is used to limit power usage by specific cores to keep them from going over budget.

Cache

The previous isolation mechanisms have been around for a while, but this is one is new to Broadwell chips (released in 2015). The problem here is that if the BE task needs 1MB of LLC and the LC task needs 4MB of LLC, a single large allocation from the BE task will scribble all over the LLC, which is shared, wiping out the 4MB of cached data the LC needs.

Intel's “Cache Allocation Technology” (CAT) allows the LLC to limit which cores can can access different parts of the cache. Since we often want to pin performance sensitive tasks to cores anyway, this allows us to divide up the cache on a per-task basis.

Intel's April 2015 whitepaper on what they call Cache Allocation Technology (CAT) has some simple benchmarks comparing CAT vs. no-CAT. In this example, they measure the latency to respond to PCIe interrupts while another application has heavy CPU-to-memory traffic, with CAT on and off.

Condition Min Max Avg
no CAT 1.66 30.91 4.53
CAT 1.22 20.98 1.62

With CAT, average latency is 36% of latency without CAT. Tail latency doesn't improve as much, but there's also a substantial improvement there. That's interesting, but to me the more interesting question is how effective this is on real workloads, which we'll see when we put all of these mechanisms together.

Another use of CAT that I'm not going to discuss at all is to prevent timing attacks, like this attack, which can recover RSA keys across VMs via LLC interference.

DRAM bandwidth

Broadwell and newer Intel chips have memory bandwidth monitoring, but no control mechanism. To work around this, Heracles drops the number of cores allocated to the BE task if it's interfering with the LC task by using too much bandwidth. The coarse grained monitoring and control for this is inefficient in a number of ways that are detailed in the paper, but this still works despite the inefficiencies. However, having per-core bandwidth limiting would give better results with less effort.

Putting it all together

This graph shows the effective utilization of LC websearch with other BE tasks scheduled with enough slack that the SLA for websearch isn't violated.

From barroom conversations with folks at other companies, the baseline (in red) here already looks pretty good: 80% utilization during peak times with a 7 hour trough when utilization is below 50%. With Heracles, the worst case utilization is 80%, and the average is 90%. This is amazing.

Note that effective utilization can be greater than 100% since it's measured as throughput for the LC task on a single machine at 100% load plus throughput for the BE task on a single machine at 100% load. For example, if one task needs 100% of the DRAM bandwidth and 0% of the network bandwidth, and the other task needs the opposite, the two tasks would be able to co-locate on the same machine and achieve 200% effective utilization.

In the real world, we might “only” get 90% average utilization out of a system like Heracles. Recalling our operating cost estimate of $4 billion for a large company, if the company already had a quite-good average utilization of 75%, using a standard model for datacenter operating costs, we'd expect 15% more throughput per dollar, or $600 million in free compute. From talking to smaller companies that are on their way to becoming large (companies that spend in the range of $10 million to $100 million a year on compute), they often have utilization that's in the 20% range. Using the same total cost model again, they'd expect to get a 300% increase in compute per dollar, or $30 million to $300 million a year in free compute, depending on their size1.

Other observations

All of the papers we've looked at have a lot of interesting gems. I'm not going to go into all of them here, but there are a few that jumped out at me.

ARM / Atom servers

It's been known for a long time that datacenter machines spend approximately half their time stalled, waiting on memory. In addition, the average number of instructions per clock that server chips are able to execute on real workloads is quite low.

The top rows (with horizontal bars) are internal Google workloads and the bottom rows (with green dots) are workstation benchmarks from SPEC, a standard benchmark suite. We can see that Google workloads are lucky to average .5 instructions per clock. We also previously saw that these workloads cause cores to be stalled on memory at least half the time.

Despite spending most of their time waiting for memory and averaging something like half an instruction per clock cycle, high-end server chips do much better than Atom or ARM chips on real workloads (Reddi et al., ToCS 2011). This sounds a bit paradoxical -- if chips are just waiting on memory, why should you need a high-performance chip? A tiny ARM chip can wait just as effectively. In fact, it might even be better at waiting since having more cores waiting means it can use more bandwidth. But it turns out that servers also spend a lot of their time exploiting instruction-level parallelism, executing multiple instructions at the same time.

This is a graph of how many execution units are busy at the same time. Almost a third of the time is spent with 3+ execution units busy. In between long stalls waiting on memory, high-end chips are able to get more computation done and start waiting for the next stall earlier. Something else that's curious is that server workloads have much higher instruction cache miss rates than traditional workstation workloads.

Code and Data Prioritization Technology

Once again, the top rows (with horizontal bars) are internal Google workloads and the bottom rows (with green dots) are workstation benchmarks from SPEC, a standard suite benchmark suite. The authors attribute this increase in instruction misses to two factors. First, that it's normal to deploy large binaries (100MB) that overwhelm instruction caches. And second, that instructions have to compete with much larger data streams for space in the cache, which causes a lot of instructions to get evicted.

In order to address this problem, Intel introduced what they call “Code and Data Prioritization Technology” (CDP). This is an extension of CAT that allows cores to separately limit which subsets of the LLC instructions and data can occupy. Since it's targeted at the last-level cache, it doesn't directly address the graph above, which shows L2 cache miss rates. However, the cost of an L2 cache miss that hits in the LLC is something like 26ns on Broadwell vs. 86ns for an L2 miss that also misses the LLC and has to go to main memory, which is a substantial difference.

Kanev et al. propose going a step further and having a split icache/dcache hierarchy. This isn't exactly a radical idea -- l1 caches are already split, so why not everything else? My guess is that Intel and other major chip vendors have simulation results showing that this doesn't improve performance per dollar, but who knows? Maybe we'll see split L2 caches soon.

SPEC

A more general observation is that SPEC is basically irrelevant as a benchmark now. It's somewhat dated as a workstation benchmark, and simply completely inappropriate as a benchmark for servers, office machines, gaming machines, dumb terminals, laptops, and mobile devices2. The market for which SPEC is designed is getting smaller every year, and SPEC hasn't even been really representative of that market for at least a decade. And yet, among chip folks, it's still the most widely used benchmark around.

This is what a search query looks like at Google. A query comes in, a wide fanout set of RPCs are issued to a set of machines (the first row). Each of those machines also does a set of RPCs (the second row), those do more RPCs (the third row), and there's a fourth row that's not shown because the graph has so much going on that it looks like noise. This is one quite normal type of workload for a datacenter, and there's nothing in SPEC that looks like this.

There are a lot more fun tidbits in all of these papers, and I recommend reading them if you thought anything in this post was interesting. If you liked this post, you'll probably also like this talk by Dick Sites on various performance and profiling related topics, this post on Intel's new CLWB and PCOMMIT instructions, and this post on other "new" CPU features.

Thanks to Leah Hanson, David Kanter, Joe Wilder, Nico Erfurth, and Jason Davies for comments/corrections on this.


  1. I often hear people ask, why is company X so big? You could do that with 1/10th as many engineers! That's often not true. But even when it's true, it's usually the case that doing so would leave a lot of money on the table. As companies scale up, smaller and smaller optimizations are worthwhile. For a company with enough scale, something a small startup wouldn't spend 10 minutes on can pay itself back tenfold even if it takes a team of five people a year. [return]
  2. When I did my last set of interviews, I asked a number of mobile chip vendors how they measure things I care about on my phone, like responsiveness. Do they have a full end-to-end test with a fake finger and a camera that lets you see the actual response time to a click? Or maybe they have some tracing framework that can fake a click to see the response time? As far as I can tell, no one except Apple has a handle on this at all, which might explain why a two generation old iPhone smokes my state of the art Android phone in actual tasks, even though the Android phone which crushes workstation benchmarks like SPEC, and benchmarks of what people did in the early 80s, like Dhrystone (both of which are used by multiple mobile processor vendors). I don't know if I can convince anyone who doesn't already believe this, but choosing good benchmarks is extremely important. I use an Android phone because I got it for free. The next time I buy a phone, I'm buying one that does tasks I actually do quickly, not one that runs academic benchmarks well. [return]

Slowlock

2015-09-30 08:00:00

Every once in awhile, you hear a story like “there was a case of a 1-Gbps NIC card on a machine that suddenly was transmitting only at 1 Kbps, which then caused a chain reaction upstream in such a way that the performance of the entire workload of a 100-node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes”. The stories are interesting and the postmortems are fun to read, but it's not really clear how vulnerable systems are to this kind of failure or how prevalent these failures are.

The situation reminds me of distributed systems failures before Jepsen. There are lots of anecdotal horror stories, but a common response to those is “works for me”, even when talking about systems that are now known to be fantastically broken. A handful of companies that are really serious about correctness have good tests and metrics, but they mostly don't talk about them publicly, and the general public has no easy way of figuring out if the systems they're running are sound.

Thanh Do et al. have tried to look at this systematically -- what's the effect of hardware that's been crippled but not killed, and how often does this happen in practice? It turns out that a lot of commonly used systems aren't robust against against “limping” hardware, but that the incidence of these types of failures are rare (at least until you have unreasonably large scale).

The effect of a single slow node can be quite dramatic:

The effect of a single slow NIC on an entire cluster

The job completion rate slowed down from 172 jobs per hour to 1 job per hour, effectively killing the entire cluster. Facebook has mechanisms to deal with dead machines, but they apparently didn't have any way to deal with slow machines at the time.

When Do et al. looked at widely used open source software (HDFS, Hadoop, ZooKeeper, Cassandra, and HBase), they found similar problems.

Each subgraph is a different failure condition. F is HDFS, H is Hadoop, Z is Zookeeper, C is Cassandra, and B is HBase. The leftmost (white) bar is the baseline no-failure case. Going to the right, the next is a crash, and the subsequent bars are results for a single degraded but not crashed hardware (further right means slower). In most (but not all) cases, having degraded hardware affected performance a lot more than having failed hardware. Note that these graphs are all log scale; going up one increment is a 10x difference in performance!

Curiously, a failed disk can cause some operations to speed up. That's because there are operations that have less replication overhead if a replica fails. It seems a bit weird to me that there isn't more overhead, because the system has to both find a replacement replica and replicate data, but what do I know?

Anyway, why is a slow node so much worse than a dead node? The authors define three failure modes and explain what causes each one. There's operation limplock, when an operation is slow because some subpart of the operation is slow (e.g., a disk read is slow because the disk is degraded), node limplock, when a node is slow even for seemingly unrelated operations (e.g, a read from RAM is slow because a disk is degraded), and cluster limplock, where the entire cluster is slow (e.g., a single degraded disk makes an entire 1000 machine cluster slow).

How do these happen?

Operation Limplock

This one is the simplest. If you try to read from disk, and your disk is slow, your disk read will be slow. In the real world, we'll see this when operations have a single point of failure, and when monitoring is designed to handle total failure and not degraded performance. For example, an HBase access to a region goes through the server responsible for that region. The data is replicated on HDFS, but this doesn't help you if the node that owns the data is limping. Speaking of HDFS, it has a timeout is 60s and reads are in 64K chunks, which means your reads can slow down to almost 1K/s before HDFS will fail over to a healthy node.

Node Limplock

How can it be the case that (for example) a slow disk causes memory reads to be slow? Looking at HDFS again, it uses a thread pool. If every thread is busy very slowly completing a disk read, memory reads will block until a thread gets free.

This isn't only an issue when using limited thread pools or other bounded abstractions -- the reality is that machines have finite resources, and unbounded abstractions will run into machine limits if they aren't carefully designed to avoid the possibility. For example, Zookeeper keeps queue of operations, and a slow follower can cause the leader's queue to exhaust physical memory.

Cluster Limplock

An entire cluster can easily become unhealthy if it relies on a single primary and the primary is limping. Cascading failures can also cause this -- the first graph, where a cluster goes from completing 172 jobs an hour to 1 job an hour is actually a Facebook workload on Hadoop. The thing that's surprising to me here is that Hadoop is supposed to be tail tolerant -- individual slow tasks aren't supposed to have a large impact on the completion of the entire job. So what happened? Unhealthy nodes infect healthy nodes and eventually lock up the whole cluster.

An unhealthy node infects an entire cluster

Hadoop's tail tolerance comes from kicking off speculative computation when results are coming in slowly. In particular, when stragglers come in unusually slowly compared to other results. This works fine when a reduce node is limping (subgraph H2), but when a map node limps (subgraph H1), it can slow down all reducers in the same job, which defeats Hadoop's tail-tolerance mechanisms.

A single bad map node effectively deadlocks hadoop

To see why, we have to look at Hadoop's speculation algorithm. Each job has a progress score which is a number between 0 and 1 (inclusive). For a map, the score is the fraction of input data read. For a reduce, each of three phases (copying data from mappers, sorting, and reducing) gets 1/3 of the score. A speculative job will get run if a task has run for at least one minute and has a progress score that's less than the average for its category minus 0.2.

In case H2, the NIC is limping, so the map phase completes normally since results end up written to local disk. But when reduce nodes try to fetch data from the limping map node, they all stall, pulling down the average score for the category, which prevents speculative jobs from being run. Looking at the big picture, each Hadoop node has a limited number of map and reduce tasks. If those fill up with limping tasks, the entire node will lock up. Since Hadoop isn't designed to avoid cascading failures, this eventually causes the entire cluster to lock up.

One thing I find interesting is that this exact cause of failures was described in the original MapReduce paper, published in 2004. They even explicitly called out slow disk and network as causes of stragglers, which motivated their speculative execution algorithm. However, they didn't provide the details of the algorithm. The open source clone of MapReduce, Hadoop, attempted to avoid the same problem. Hadoop was initially released in 2008. Five years later, when the paper we're reading was published, its built-in mechanism for straggler detection not only failed to prevent multiple types of stragglers, it also failed to prevent stragglers from effectively deadlocking the entire cluster.

Conclusion

I'm not going to go into details of how each system fared under testing. That's detailed quite nicely in the paper, which I recommend reading the paper if you're curious. To summarize, Cassandra does quite well, whereas HDFS, Hadoop, and HBase don't.

Cassandra seems to do well for two reasons. First, this patch from 2009 prevents queue overflows from infecting healthy nodes, which prevents a major failure mode that causes cluster-wide failures in other systems. Second, the architecture used (SEDA) decouples different types of operations, which lets good operations continue to execute even when some operations are limping.

My big questions after reading this paper are, how often do these kinds of failures happen, how, and shouldn't reasonable metrics/reporting catch this sort of thing anyway?

For the answer to the first question, many of the same authors also have a paper where they looked at 3000 failures in Cassandra, Flume, HDFS, and ZooKeeper and determined which failures were hardware related and what the hardware failure was.

14 cases of degraded performance vs. 410 other hardware failures. In their sample, that's 3% of failures; rare, but not so rare that we can ignore the issue.

If we can't ignore these kinds of errors, how can we catch them before they go into production? The paper uses the Emulab testbed, which is really cool. Unfortunately, the Emulab page reads “Emulab is a public facility, available without charge to most researchers worldwide. If you are unsure if you qualify for use, please see our policies document, or ask us. If you think you qualify, you can apply to start a new project.”. That's understandable, but that means it's probably not a great solution for most of us.

The vast majority of limping hardware is due to network or disk slowness. Why couldn't a modified version of Jepsen, or something like it, simulate disk or network slowness? A naive implementation wouldn't get anywhere near the precision of Emulab, but since we're talking about order of magnitude slowdowns, having 10% (or even 2x) variance should be ok for testing the robustness of systems against degraded hardware. There are a number of ways you could imagine that working. For example, to simulate a slow network on linux, you could try throttling via qdisc, hooking syscalls via ptrace, etc. For a slow CPU, you can rate-limit via cgroups and cpu.shares, or just map the process to UC memory (or maybe WT or WC if that's a bit too slow), and so on and so forth for disk and other failure modes.

That leaves my last question, shouldn't systems already catch these sorts of failures even if they're not concerned about them in particular? As we saw above, systems with cripplingly slow hardware are rare enough that we can just treat them as dead without significantly impacting our total compute resources. And systems with crippled hardware can be detected pretty straightforwardly. Moreover, multi-tenant systems have to do continuous monitoring of their own performance to get good utilization anyway.

So why should we care about designing systems that are robust against limping hardware? One part of the answer is defense in depth. Of course we should have monitoring, but we should also have systems that are robust when our monitoring fails, as it inevitably will. Another part of the answer is that by making systems more tolerant to limping hardware, we'll also make them more tolerant to interference from other workloads in a multi-tenant environment. That last bit is a somewhat speculative empirical question -- it's possible that it's more efficient to design systems that aren't particularly robust against interference from competing work on the same machine, while using better partitioning to avoid interference.



Thanks to Leah Hanson, Hari Angepat, Laura Lindzey, Julia Evans, and James M. Lee for comments/discussion.

Steve Yegge's prediction record

2015-08-31 08:00:00

I try to avoid making predictions1. It's a no-win proposition: if you're right, hindsight bias makes it look like you're pointing out the obvious. And most predictions are wrong. Every once in a while when someone does a review of predictions from pundits, they're almost always wrong at least as much as you'd expect from random chance, and then hindsight bias makes each prediction look hilariously bad.

But, occasionally, you run into someone who makes pretty solid non-obvious predictions. I was re-reading some of Steve Yegge's old stuff and it turns out that he's one of those people.

His most famous prediction is probably the rise of JavaScript. This now seems incredibly obvious in hindsight, so much so that the future laid out in Gary Bernhardt's Birth and Death of JavaScript seems at least a little plausible. But you can see how non-obvious Steve's prediction was at the time by reading both the comments on his blog, and comments from HN, reddit, and the other usual suspects.

Steve was also crazy-brave enough to post ten predictions about the future in 2004. He says “Most of them are probably wrong. The point of the exercise is the exercise itself, not in what results.”, but the predictions are actually pretty reasonable.

Prediction #1: XML databases will surpass relational databases in popularity by 2011

2011 might have been slightly too early and JSON isn't exactly XML, but NoSQL databases have done really well for pretty much the reason given in the prediction, “Nobody likes to do O/R mapping; everyone just wants a solution.”. Sure, Mongo may lose your data, but it's easy to set up and use.

Prediction #2: Someone will make a lot of money by hosting open-source web applications

This depends on what you mean by “a lot”, but this seems basically correct.

We're rapidly entering the age of hosted web services, and big companies are taking advantage of their scalable infrastructure to host data and computing for companies without that expertise.

For reasons that seem baffling in retrospect, Amazon understood this long before any of its major competitors and was able to get a huge head start on everybody else. Azure didn't get started until 2009, and Google didn't get serious about public cloud hosting until even later.

Now that everyone's realized what Steve predicted in 2004, it seems like every company is trying to spin up a public cloud offering, but the market is really competitive and hiring has become extremely difficult. Despite giving out a large number of offers an integer multiple above market rates, Alibaba still hasn't managed to put together a team that's been able to assemble a competitive public cloud, and companies that are trying to get into the game now without as much cash to burn as Alibaba are having an even harder time.

For both bug databases and source-control systems, the obstacle to outsourcing them is trust. I think most companies would love it if they didn't have to pay someone to administer Bugzilla, Subversion, Twiki, etc. Heck, they'd probably like someone to outsource their email, too.

A lot of companies have moved both issue tracking and source-control to GitHub or one of its competitors, and even more have moved if you just count source-control. Hosting your own email is also a thing of the past for all but the most paranoid (or most bogged down in legal compliance issues).

Prediction #3: Multi-threaded programming will fall out of favor by 2012

Hard to say if this is right or not. Depends on who you ask. This seems basically right for applications that don't need the absolute best levels for performance, though.

In the past, oh, 20 years since they invented threads, lots of new, safer models have arrived on the scene. Since 98% of programmers consider safety to be unmanly, the alternative models (e.g. CSP, fork/join tasks and lightweight threads, coroutines, Erlang-style message-passing, and other event-based programming models) have largely been ignored by the masses, including me.

Shared memory concurrency is still where it's at for really high performance programs, but Go has popularized CSP; actors and futures are both “popular” on the JVM; etc.

Prediction #4: Java's "market share" on the JVM will drop below 50% by 2010

I don't think this was right in 2010, or even now, although we're moving in the right direction. There's a massive amount of dark matter -- programmers who do business logic and don't blog or give talks -- that makes this prediction unlikely to come true in the near future.

It's impossible to accurately measure market share, but basically every language ranking you can find will put Java in the top 3, with Scala and Clojure not even in the top 10. Given the near power-law distribution of measured language usage, Java must still be above 90% share (and that's probably a gross underestimate).

Not even close. Depending on how you measure this, Clojure might be in the top 20 (it is if you believe the Redmonk rankings), but it's hard to see it making it into the top 10 in this decade. As with the previous prediction, there's just way too much inertia here. Breaking into the top 10 means joining the ranks of Java, JS, PHP, Python, Ruby, C, C++, and C#. Clojure just isn't boring enough. C# was able to sneak in by pretending to boring, but Clojure's got no hope of doing that and there isn't really another Dylan on the horizon.

Prediction #6: A new internet community-hangout will appear. One that you and I will frequent

This seems basically right, at least for most values of “you”.

Wikis, newsgroups, mailing lists, bulletin boards, forums, commentable blogs — they're all bullshit. Home pages are bullshit. People want to socialize, and create content, and compete lightly with each other at different things, and learn things, and be entertained: all in the same place, all from their couch. Whoever solves this — i.e. whoever creates AOL for real people, or whatever the heck this thing turns out to be — is going to be really, really rich.

Facebook was founded the year that was written. Zuckerberg is indeed really, really, rich.

Prediction #7: The mobile/wireless/handheld market is still at least 5 years out

Five years from Steve's prediction would have been 2009. Although the iPhone was released in 2007, it was a while before sales really took off. In 2009, the majority of phones were feature phones, and Android was barely off the ground.

Symbian is in the lead until Q4 2010!

Note that this graph only runs until 2013; if you graph things up to 2015 on a linear scale, sales are so low in 2009 that you basically can't even see what's going on.

Prediction #8: Someday I will voluntarily pay Google for one of their services

It's hard to tell if this is correct (Steve, feel free to let me know), but it seems true in spirit. Google has more and more services that they charge for, and they're even experimenting with letting people pay to avoid seeing ads.

Prediction #9: Apple's laptop sales will exceed those of HP/Compaq, IBM, Dell and Gateway combined by 2010

If you include tablets, Apple hit #1 in the market by 2010, but I don't think they do better than all of the old workhorses combined. Again, this seems to underestimate the effect of dark matter, in this case, people buying laptops for boring reasons, e.g., corporate buyers and normal folks who want something under Apple's price range.

Prediction #10: In five years' time, most programmers will still be average

More of a throwaway witticism than a prediction, but sure.

That's a pretty good set of predictions for 2004. With the exception of the bit about Lisp, all of the predictions seem directionally correct; the misses are mostly caused by underestimating the sheer about of inertia it takes for a young/new solution to take over.

Steve also has a number of posts that aren't explicitly about predictions that, nevertheless, make pretty solid predictions about how things are today, written way back in 2004. There's It's Not Software, which was years ahead of its time about how people write “software”, how writing server apps is really different from writing shrinkwrap software in a way that obsoletes a lot of previously solid advice, like Joel's dictum against rewrites, as well as how service oriented architectures look; the Google at Delphi (again from 2004) correctly predicts the importance of ML and AI as well as Google's very heavy investment in ML; an old interview where he predicts "web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/"; etc. A number of Steve's internal Google blog posts also make interesting predictions, but AFAIK those are confidential. Of course these all these things seem obvious in retrospect, but that's just part of Steve's plan to pass as a normal human being.

In a relatively recent post, Steve throws Jeff Bezos under the bus, exposing him as one of a number of “hyper-intelligent aliens with a tangential interest in human affairs”. While the crowd focuses on Jeff, Steve is able to sneak out the back. But we're onto you, Steve.

Thanks to Leah Hanson, Chris Ball, Mindy Preston, and Paul Gross for comments/corrections.


  1. When asked about a past prediction of his, Peter Thiel commented that writing is dangerous and mentioned that a professor once told him that writing a book is more dangerous than having a child -- you can always disown a child, but there's nothing you can do to disown a book.

    The only prediction I can recall publicly making is that I've been on the record for at least five years saying that, despite the hype, ARM isn't going completely crush Intel in the near future, but that seems so obvious that it's not even worth calling it a prediction. Then again, this was a minority opinion up until pretty recently, so maybe it's not that obvious.

    I've also correctly predicted the failure of a number of chip startups, but since the vast majority of startups fail, that's expected. Predicting successes is much more interesting, and my record there is decidedly mixed. Based purely on who was involved, I thought that SiByte, Alchemy, and PA Semi were good bets. Of those, SiByte was a solid success, Alchemy didn't work out, and PA Semi was maybe break-even.

    [return]

Reading postmortems

2015-08-20 08:00:00

I love reading postmortems. They're educational, but unlike most educational docs, they tell an entertaining story. I've spent a decent chunk of time reading postmortems at both Google and Microsoft. I haven't done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.

Error Handling

Proper error handling code is hard. Bugs in error handling code are a major cause of bad problems. This means that the probability of having sequential bugs, where an error causes buggy error handling code to run, isn't just the independent probabilities of the individual errors multiplied. It's common to have cascading failures cause a serious outage. There's a sense in which this is obvious -- error handling is generally regarded as being hard. If I mention this to people they'll tell me how obvious it is that a disproportionate number of serious postmortems come out of bad error handling and cascading failures where errors are repeatedly not handled correctly. But despite this being “obvious”, it's not so obvious that sufficient test and static analysis effort are devoted to making sure that error handling works.

For more on this, Ding Yuan et al. have a great paper and talk: Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. The paper is basically what it says on the tin. The authors define a critical failure as something that can take down a whole cluster or cause data corruption, and then look at a couple hundred bugs in Cassandra, HBase, HDFS, MapReduce, and Redis, to find 48 critical failures. They then look at the causes of those failures and find that most bugs were due to bad error handling. 92% of those failures are actually from errors that are handled incorrectly.

Graphic of previous paragraph

Drilling down further, 25% of bugs are from simply ignoring an error, 8% are from catching the wrong exception, 2% are from incomplete TODOs, and another 23% are "easily detectable", which are defined as cases where “the error handling logic of a non-fatal error was so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs”. By the way, this is one reason I don't mind Go style error handling, despite the common complaint that the error checking code is cluttering up the main code path. If you care about building robust systems, the error checking code is the main code!

The full paper has a lot of gems that that I mostly won't describe here. For example, they explain the unreasonable effectiveness of Jepsen (98% of critical failures can be reproduced in a 3 node cluster). They also dig into what percentage of failures are non-deterministic (26% of their sample), as well as the causes of non-determinism, and create a static analysis tool that can catch many common error-caused failures.

Configuration

Configuration bugs, not code bugs, are the most common cause I've seen of really bad outages. When I looked at publicly available postmortems, searching for “global outage postmortem” returned about 50% outages caused by configuration changes. Publicly available postmortems aren't a representative sample of all outages, but a random sampling of postmortem databases also reveals that config changes are responsible for a disproportionate fraction of extremely bad outages. As with error handling, I'm often told that it's obvious that config changes are scary, but it's not so obvious that most companies test and stage config changes like they do code changes.

Except in extreme emergencies, risky code changes are basically never simultaneously pushed out to all machines because of the risk of taking down a service company-wide. But it seems that every company has to learn the hard way that seemingly benign config changes can also cause a company-wide service outage. For example, this was the cause of the infamous November 2014 Azure outage. I don't mean to pick on MS here; their major competitors have also had serious outages for similar reasons, and they've all put processes into place to reduce the risk of that sort of outage happening again.

I don't mean to pick on large cloud companies, either. If anything, the situation there is better than at most startups, even very well funded ones. Most of the “unicorn” startups that I know of don't have a proper testing/staging environment that lets them test risky config changes. I can understand why -- it's often hard to set up a good QA environment that mirrors prod well enough that config changes can get tested, and like driving without a seatbelt, nothing bad happens the vast majority of the time. If I had to make my own seatbelt before driving my car, I might not drive with a seatbelt either. Then again, if driving without a seatbelt were as scary as making config changes, I might consider it.

Back in 1985, Jim Gray observed that "operator actions, system configuration, and system maintenance was the main source of failures -- 42%". Since then, there have been a variety of studies that have found similar results. For example, Rabkin and Katz found the following causes for failures:

Causes in decreasing order: misconfig, bug, operational, system, user, install, hardware

Hardware

Basically every part of a machine can fail. Many components can also cause data corruption, often at rates that are much higher than advertised. For example, Schroeder, Pinherio, and Weber found DRAM error rates were more than an order of magnitude worse than advertised. The number of silent errors is staggering, and this actually caused problems for Google back before they switched to ECC RAM. Even with error detecting hardware, things can go wrong; relying on Ethernet checksums to protect against errors is unsafe and I've personally seen malformed packets get passed through as valid packets. At scale, you can run into more undetected errors than you expect, if you expect hardware checks to catch hardware data corruption.

Failover from bad components can also fail. This AWS failure tells a typical story. Despite taking reasonable sounding measures to regularly test the generator power failover process, a substantial fraction of AWS East went down when a storm took out power and a set of backup generators failed to correctly provide power when loaded.

Humans

This section should probably be called process error and not human error since I consider having humans in a position where they can accidentally cause a catastrophic failure to be a process bug. It's generally accepted that, if you're running large scale systems, you have to have systems that are robust to hardware failures. If you do the math on how often machines die, it's obvious that systems that aren't robust to hardware failure cannot be reliable. But humans are even more error prone than machines. Don't get me wrong, I like humans. Some of my best friends are human. But if you repeatedly put a human in a position where they can cause a catastrophic failure, you'll eventually get a catastrophe. And yet, the following pattern is still quite common:

Oh, we're about to do a risky thing! Ok, let's have humans be VERY CAREFUL about executing the risky operation. Oops! We now have a global outage.

Postmortems that start with “Because this was a high risk operation, foobar high risk protocol was used” are ubiquitous enough that I now think of extra human-operated steps that are done to mitigate human risk as an ops smell. Some common protocols are having multiple people watch or confirm the operation, or having ops people standing by in case of disaster. Those are reasonable things to do, and they mitigate risk to some extent, but in many postmortems I've read, automation could have reduced the risk a lot more or removed it entirely. There are a lot of cases where the outage happened because a human was expected to flawlessly execute a series of instructions and failed to do so. That's exactly the kind of thing that programs are good at! In other cases, a human is expected to perform manual error checking. That's sometimes harder to automate, and a less obvious win (since a human might catch an error case that the program misses), but in most cases I've seen it's still a net win to automate that sort of thing.

Causes in decreasing order: human error, system failure, out of IPs, natural disaster

In an IDC survey, respondents voted human error as the most troublesome cause of problems in the datacenter.

One thing I find interesting is how underrepresented human error seems to be in public postmortems. As far as I can tell, Google and MS both have substantially more automation than most companies, so I'd expect their postmortem databases to contain proportionally fewer human error caused outages than I see in public postmortems, but in fact it's the opposite. My guess is that's because companies are less likely to write up public postmortems when the root cause was human error enabled by risky manual procedures. A prima facie plausible alternate reason is that improved technology actually increases the fraction of problems caused by humans, which is true in some industries, like flying. I suspect that's not the case here due to the sheer number of manual operations done at a lot of companies, but there's no way to tell for sure without getting access to the postmortem databases at multiple companies. If any company wants to enable this analysis (and others) to be done (possibly anonymized), please get in touch.

Monitoring / Alerting

The lack of proper monitoring is never the sole cause of a problem, but it's often a serious contributing factor. As is the case for human errors, these seem underrepresented in public postmortems. When I talk to folks at other companies about their worst near disasters, a large fraction of them come from not having the right sort of alerting set up. They're often saved having a disaster bad enough to require a public postmortem by some sort of ops heroism, but heroism isn't a scalable solution.

Sometimes, those near disasters are caused by subtle coding bugs, which is understandable. But more often, it's due to blatant process bugs, like not having a clear escalation path for an entire class of failures, causing the wrong team to debug an issue for half a day, or not having a backup on-call, causing a system to lose or corrupt data for hours before anyone notices when (inevitably) the on-call person doesn't notice that something's going wrong.

The Northeast blackout of 2003 is a great example of this. It could have been a minor outage, or even just a minor service degradation, but (among other things) a series of missed alerts caused it to become one of the worst power outages ever.

Not a Conclusion

This is where the conclusion's supposed to be, but I'd really like to do some serious data analysis before writing some kind of conclusion or call to action. What should I look for? What other major classes of common errors should I consider? These aren't rhetorical questions and I'm genuinely interested in hearing about other categories I should think about. Feel free to ping me here. I'm also trying to collect public postmortems here.

One day, I'll get around to the serious analysis, but even without going through and classifying thousands of postmortems, I'll probably do a few things differently as a result of having read a bunch of these. I'll spend relatively more time during my code reviews on errors and error handling code, and relatively less time on the happy path. I'll also spend more time checking for and trying to convince people to fix “obvious” process bugs.

One of the things I find to be curious about these failure modes is that when I talked about what I found with other folks, at least one person told me that each process issue I found was obvious. But these “obvious” things still cause a lot of failures. In one case, someone told me that what I was telling them was obvious at pretty much the same time their company was having a global outage of a multi-billion dollar service, caused by the exact thing we were talking about. Just because something is obvious doesn't mean it's being done.

Elsewhere

Richard Cook's How Complex Systems Fail takes a more general approach; his work inspired The Checklist Manifesto, which has saved lives.

Allspaw and Robbin's Web Operations: Keeping the Data on Time talks about this sort of thing in the context of web apps. Allspaw also has a nice post about some related literature from other fields.

In areas that are a bit closer to what I'm used to, there's a long history of studying the causes of failures. Some highlights include Jim Gray's Why Do Computers Stop and What Can Be Done About It? (1985), Oppenheimer et. al's Why Do Internet Services Fail, and What Can Be Done About It? (2003), Nagaraja et. al's Understanding and Dealing with Operator Mistakes in Internet Services (2004), part of Barroso et. al's The Datacenter as a Computer (2009), and Rabkin and Katz's How Hadoop Clusters Break (2013), and Xu et. al's Do Not Blame Users for Misconfigurations.

There's also a long history of trying to understand aircraft reliability, and the story of how processes have changed over the decades is fascinating, although I'm not sure how to generalize those lessons.

Just as an aside, I find it interesting how hard it's been to eke out extra uptime and reliability. In 1974, Ritchie and Thompson wrote about a system "costing as little as $40,000" with 98% uptime. A decade later, Jim Gray uses 99.6% uptime as a reasonably good benchmark. We can do much better than that now, but the level of complexity required to do it is staggering.

Acknowledgments

Thanks to Leah Hanson, Anonymous, Marek Majkowski, Nat Welch, Joe Wilder, and Julia Hansbrough for providing comments on a draft of this. Anonymous, if you prefer to not be anonymous, send me a message on Zulip. For anyone keeping score, that's three folks from Google, one person from Cloudflare, and one anonymous commenter. I'm always open to comments/criticism, but I'd be especially interested in comments from folks who work at companies with less scale. Do my impressions generalize?

Thanks to gwern and Dan Reif for taking me up on this and finding some bugs in this post.

Slashdot and Sourceforge

2015-05-31 08:00:00

If you've followed any tech news aggregator in the past week (the week of the 24th of May, 2015), you've probably seen the story about how SourceForge is taking over admin accounts for existing projects and injecting adware in installers for packages like GIMP. For anyone not following the story, SourceForge has a long history of adware laden installers, but they used to be opt-in. It appears that the process is now mandatory for many projects.

People have been wary of SourceForge ever since they added a feature to allow projects to opt-in to adware bundling, but you could at least claim that projects are doing it by choice. But now that SourceForge is clearly being malicious, they've wiped out all of the user trust that was built up over sixteen years of operating. No clueful person is going to ever download something from SourceForge again. If search engines start penalizing SourceForge for distributing adware, they won't even get traffic from people who haven't seen this story, wiping out basically all of their value.

Whenever I hear about a story like this, I'm amazed at how quickly it's possible to destroy user trust, and how much easier it is to destroy a brand than to create one. In that vein, it's funny to see Slashdot (which is owned by the same company as SourceForge) also attempting to destroy their own brand. They're the only major tech news aggregator which hasn't had a story on this, and that's because they've buried every story that someone submits. This has prompted people to start submitting comments about this on other stories.

A highly upvoted comment about SourceForge on Slashdot

I find this to be pretty incredible. How is it possible that someone, somewhere, thinks that censoring SourceForge's adware bundling on Slashdot is a net positive for Slashdot Media, the holding company that owns Slashdot and SourceForge? A quick search on either Google or Google News shows that the story has already made it to a number of major tech publications, making the value of suppressing the story nearly zero in the best case. And in the worst case, this censorship will create another Digg moment1, where readers stop trusting the moderators and move on to sites that aren't as heavily censored. There's basically no upside here and a substantial downside risk.

I can see why DHI, the holding company that owns Slashdot Media, would want to do something. Their last earnings report indicated that Slashdot Media isn't doing well, and the last thing they need is bad publicity driving people away from Slashdot:

Corporate & Other segment revenues decreased 6% to $4.5 million for the quarter ended March 31, 2015, reflecting a decline in certain revenue streams at Slashdot Media.

Compare that to their post-acquisition revenue from Q4 2012, which is the first quarter after DHI purchased Slashdot Media:

Revenues totaled $52.7 . . . including $4.7 million from the Slashdot Media acquisition

“Corporate & Other” seems to encompass more than just Slashdot Media. And despite that, as well as milking SourceForge for all of the short-term revenue they can get, all of “Corporate & Other” is doing worse than Slashdot Media alone in 20122. Their original stated plan for SourceForge and Slashdot was "to keep them pretty much the same as they are [because we] are very sensitive to not disrupting how users use them . . .", but it didn't take long for them realize that wasn't working; here's a snippet from their 2013 earnings report:

advertising revenue has declined over the past year and there is no improvement expected in the future financial performance of Slashdot Media's underlying advertising business. Therefore, $7.2 million of intangible assets and $6.3 million of goodwill related to Slashdot Media were reduced to zero.

I believe it was shortly afterwards that SourceForge started experimenting with adware/malware bundlers for projects that opted in, which somehow led us to where we are today.

I can understand the desire to do something to help Slashdot Media, but it's hard to see how permanently damaging Slashdot's reputation is going to help. As far as I can tell, they've fallen back to this classic syllogism: “We must do something. This is something. We must do this.”

Update: The Sourceforge/GIMP story is now on Slashdot, the week after it appeared everywhere else and a day after this was written, with a note about how the editor just got back from the weekend to people "freaking out that we're 'burying' this story", playing things down to make it sound like this would have been posted if it wasn't the weekend. That's not a very convincing excuse when tens of stories were posted by various editors, including the one who ended up making the Sourceforge/GIMP post, since the Sourceforge/GIMP story broke last Wednesday. The "weekend" excuse seems especially flimsy since when the Sourceforge/nmap story broke on the next Wednesday and Slashdot was under strict scrutiny for the previously delay, they were able to publish that story almost immediately on the same day, despite it having been the start of the "weekend" the last time a story broke on a Wednesday. Moreover, the Slashdot story is very careful to use terms like "modified binary" and "present third party offers" instead of "malware" or "adware".

Of course this could all just be an innocent misunderstanding, and I doubt we'll ever have enough information to know for sure either way. But Slashdot's posted excuse certainly isn't very confidence inspiring.


  1. Ironically, if you follow the link, you'll see that Slashdot's founder, CmdrTaco, is against “content getting removed for being critical of sponsors”. It's not that Slashdot wasn't biased back then; Slashdot used to be notorious for their pro-Linux pro-open source anti-MS anti-commercial bias. If you read through the comments in that link, you'll see that a lot of people lost their voting abilities after upvoting a viewpoint that runs against Slashdot's inherent bias. But it's Slashdot's bias that makes the omission of this story so remarkable. This is exactly the kind of thing Slashdot readers and moderators normally make hay about. But CmdrTaco has been gone for years, as has the old Slashdot. [return]
  2. If you want to compare YoY results, Slashdot Media pulled in $4M in Q1 2013. [return]

The googlebot monopoly

2015-05-27 08:00:00

TIL that Bell Labs and a whole lot of other websites block archive.org, not to mention most search engines. Turns out I have a broken website link in a GitHub repo, caused by the deletion of an old webpage. When I tried to pull the original from archive.org, I found that it's not available because Bell Labs blocks the archive.org crawler in their robots.txt:

User-agent: Googlebot
User-agent: msnbot
User-agent: LSgsa-crawler
Disallow: /RealAudio/
Disallow: /bl-traces/
Disallow: /fast-os/
Disallow: /hidden/
Disallow: /historic/
Disallow: /incoming/
Disallow: /inferno/
Disallow: /magic/
Disallow: /netlib.depend/
Disallow: /netlib/
Disallow: /p9trace/
Disallow: /plan9/sources/
Disallow: /sources/
Disallow: /tmp/
Disallow: /tripwire/
Visit-time: 0700-1200
Request-rate: 1/5
Crawl-delay: 5

User-agent: *
Disallow: /

In fact, Bell Labs not only blocks the Internet Archiver bot, it blocks all bots except for Googlebot, msnbot, and their own corporate bot. And msnbot was superseded by bingbot five years ago!

A quick search using a term that's only found at Bell Labs1, e.g., “This is a start at making available some of the material from the Tenth Edition Research Unix manual.”, reveals that bing indexes the page; either bingbot follows some msnbot rules, or that msnbot still runs independently and indexes sites like Bell Labs, which ban bingbot but not msnbot. Luckily, in this case, a lot of search engines (like Yahoo and DDG) use Bing results, so Bell Labs hasn't disappeared from the non-Google internet, but you're out of luck if you're one of the 55% of Russians who use yandex.

And all that is a relatively good case, where one non-Google crawler is allowed to operate. It's not uncommon to see robots.txt files that ban everything but Googlebot. Running a competing search engine and preventing a Google monopoly is hard enough without having sites ban non-Google bots. We don't need to make it even harder, nor do we need to accidentally2 ban the Internet Archive bot.

P.S. While you're checking that your robots.txt doesn't ban everyone but Google, consider looking at your CPUID checks to make sure that you're using feature flags instead of banning everyone but Intel and AMD.

BTW, I do think there can be legitimate reasons to block crawlers, including archive.org, but I don't think that the common default many web devs have, of blocking everything but googlebot, is really intended to block competing search engines as well as archive.org.

2021 Update: since this post was first published, archive.org started ignoring being blocked in robots.txt and archives posts where they are blocked in robots.txt. I've heard that some competing search engines do the same thing, so this mis-use of robots.txt, where sites ban everything but googlebot, is slowly making robots.txt effectively useless, much like browsers identify themselves as every browser in user-agent strings to work around sites that incorrectly block browsers they don't think are compatible.

A related thing is that sites will sometimes ban competing search engines, like Bing, in a fit of pique, which they wouldn't do to Google since Google provides too much traffic for them to be able get away with that, e.g., Discourse banned Bing because they were upset that Bing was crawling discourse at 0.46 QPS.


  1. At least until this page gets indexed. Google has a turnaround time of minutes to hours on updates to this page, which I find pretty amazing. I actually find that more impressive than seeing stuff on CNN reflected in seconds to minutes. Of course search engines are going to want to update CNN in real time. But a blog like mine? If they're crawling a niche site like mine every hour, they must also be crawling millions or tens of millions of other sites on an hourly basis and updating their index appropriately. Either that or they pull updates off of RSS, but even that requires millions or tens of millions of index updates per hour for sites with my level of traffic. [return]
  2. I don't object, in principle, to a robots.txt that prevents archive.org from archiving sites -- although the common opinion among programmers seems to be that it's a sin to block archive.org, I believe it's fine to do that if you don't want old versions of your site floating around. But it should be an informed decision, not an accident. [return]

A defense of boring languages

2015-05-25 08:00:00

Boring languages are underrated. Many appear to be rated quite highly, at least if you look at market share. But even so, they're underrated. Despite the popularity of Dan McKinley's "choose boring technology" essay, boring languages are widely panned. People who use them are too (e.g., they're a target of essays by Paul Graham and Joel Spolsky, and other people have picked up a similar attitude).

A commonly used pitch for interesting languages goes something like "Sure, you can get by with writing blub for boring work, which almost all programmers do, but if you did interesting work, then you'd want to use an interesting language". My feeling is that this has it backwards. When I'm doing boring work that's basically bottlenecked on the speed at which I can write boilerplate, it feels much nicer to use an interesting language (like F#), which lets me cut down on the amount of time spent writing boilerplate. But when I'm doing interesting work, the boilerplate is a rounding error and I don't mind using a boring language like Java, even if that means a huge fraction of the code I'm writing is boilerplate.

Another common pitch, similar to the above, is that learning interesting languages will teach you new ways to think that will make you a much more effective programmer1. I can't speak for anyone else, but I found that line of reasoning compelling when I was early in my career and learned ACL2 (a Lisp), Forth, F#, etc.; enough of it stuck that I still love F#. But, despite taking the advice that "learning a wide variety of languages that support different programming paradigms will change how you think" seriously, my experience has been that the things I've learned mostly let me crank through boilerplate more efficiently. While that's pretty great when I have a boilerplate-constrained problem, when I have a hard problem, I spend so little time on that kind of stuff that the skills I learned from writing a wide variety of languages don't really help me; instead, what helps me is having domain knowledge that gives me a good lever with which I can solve the hard problem. This explains something I'd wondered about when I finished grad school and arrived in the real world: why is it that the programmers who build the systems I find most impressive typically have deep domain knowledge rather than interesting language knowledge?

Another perspective on this is Sutton's response when asked why he robbed banks, "because that's where the money is". Why do I work in boring languages? Because that's what the people I want to work with use, and what the systems I want to work on are written in. The vast majority of the systems I'm interested in are writing in boring languages. Although that technically doesn't imply that the vast majority of people I want to work with primarily use and have their language expertise in boring languages, that also turns out to be the case in practice. That means that, for greenfield work, it's also likely that the best choice will be a boring language. I think F# is great, but I wouldn't choose it over working with the people I want to work with on the problems that I want to work on.

If I look at the list of things I'm personally impressed with (things like Spanner, BigTable, Colossus, etc.), it's basically all C++, with almost all of the knockoffs in Java. When I think for a minute, the list of software written in C, C++, and Java is really pretty long. Among the transitive closure of things I use and the libraries and infrastructure used by those things, those three languages are ahead by a country mile, with PHP, Ruby, and Python rounding out the top 6. Javascript should be in there somewhere if I throw in front-end stuff, but it's so ubiquitous that making a list seems a bit pointless.

Below are some lists of software written in boring languages. These lists are long enough that I’m going to break them down into some arbitrary sublists. As is often the case, these aren’t really nice orthogonal categories and should be tags, but here we are. In the lists below, apps are categorized under “Backend” based on the main language used on the backend of a webapp. The other categories are pretty straightforward, even if their definitions a bit idiosyncratic and perhaps overly broad.

C

Operating Systems

Linux, including variants like KindleOS
BSD
Darwin (with C++)
Plan 9
Windows (kernel in C, with some C++ elsewhere)

Platforms/Infrastructure

Memcached
SQLite
nginx
Apache
DB2
PostgreSQL
Redis
Varnish
HAProxy AWS Lambda workers (with most of the surrounding infrastructure written in Java), according to @jayachdee

Desktop Apps

git
Gimp (with perl)
VLC
Qemu
OpenGL
FFmpeg
Most GNU userland tools
Most BSD userland tools
AFL
Emacs
Vim

C++

Operating Systems

BeOS/Haiku

Platforms/Infrastructure

GFS
Colossus
Ceph
Dremel
Chubby
BigTable
Spanner
MySQL
ZeroMQ
ScyllaDB
MongoDB
Mesos
JVM
.NET

Backend Apps

Google Search
PayPal
Figma (front-end written in C++ and cross-compiled to JS)

Desktop Apps

Chrome
MS Office
LibreOffice (with Java)
Evernote (originally in C#, converted to C++)
Firefox
Opera
Visual Studio (with C#)
Photoshop, Illustrator, InDesign, etc.
gcc
llvm/clang
Winamp
Z3
Most AAA games
Most pro audio and video production apps

Elsewhere

Also see this list and some of the links here.

Java

Platforms/Infrastructure

Hadoop
HDFS
Zookeeper
Presto
Cassandra
Elasticsearch
Lucene
Tomcat
Jetty

Backend Apps

Gmail
LinkedIn
Ebay
Most of Netflix
A large fraction of Amazon services

Desktop Apps

Eclipse
JetBrains IDEs
SmartGit
Minecraft

VHDL/Verilog

I'm not even going to make a list because basically every major microprocessor, NIC, switch, etc. is made in either VHDL or Verilog. For existing projects, you might say that this is because you have a large team that's familiar with some boring language, but I've worked on greenfield hardware/software co-design for deep learning and networking virtualization, both with teams that are hired from scratch for the project, and we still used Verilog, despite one of the teams having one of the larger collections of bluespec proficient hardware engineers anywhere outside of Arvind's group at MIT.

Please suggest other software that you think belongs on this list; it doesn't have to be software that I personally use. Also, does anyone know what EC2, S3, and Redshift are written in? I suspect C++, but I couldn't find a solid citation for that. This post was last updated 2021-08.

Appendix: meta

One thing I find interesting is that, in personal conversations with people, the vast majority of experienced developers I know think that most mainstream languages are basically fine, modulo performance constraints, and this is even more true among people who've built systems that are really impressive to me. Online discussion of what someone might want to learn is very different, with learning interesting/fancy languages being generally high up on people's lists. When I talk to new programmers, they're often pretty influenced by this (e.g., at Recurse Center, before ML became trendy, learning fancy languages was the most popular way people tried to become better as a programmer, and I'd say that's now #2 behind ML). While I think learning a fancy language does work for some people, I'd say that's overrated in that there are many other techniques that seem to click with at least the same proportion of people who try it that are much less popular.

A question I have is, why is online discussion about this topic so one-sided while the discussions I've had in real life are so oppositely one-sided. Of course, neither people who are loud on the internet nor people I personally know are representative samples of programmers, but I still find it interesting.

Thanks to Leah Hanson, James Porter, Waldemar Q, Nat Welch, Arjun Sreedharan, Rafa Escalante, @matt_dz, Bartlomiej Filipek, Josiah Irwin, @jayachdee, Larry Ogrondek, Miodrag Milic, Presto, Matt Godbolt, Leah Hanson, Noah Haasis, Lifan Zeng, @[email protected], and Josiah Irwin for comments/corrections/discussion.


  1. a variant of this argument goes beyond teaching you techniques and says that the languages you know determine what you think via the Sapir-Whorf hypothesis. I don't personally find this compelling since, when I'm solving hard problems, I don't think about things in a programming language. YMMV if you think in a programming language, but I think of an abstract solution and then translate the solution to a language, so having another language in my toolbox can, at most, help me think of better translations and save on translation. [return]

Advantages of monorepos

2015-05-17 08:00:00

Here's a conversation I keep having:

Someone: Did you hear that Facebook/Google uses a giant monorepo? WTF!
Me: Yeah! It's really convenient, don't you think?
Someone: That's THE MOST RIDICULOUS THING I've ever heard. Don't FB and Google know what a terrible idea it is to put all your code in a single repo?
Me: I think engineers at FB and Google are probably familiar with using smaller repos (doesn't Junio Hamano work at Google?), and they still prefer a single huge repo for [reasons].
Someone: Oh that does sound pretty nice. I still think it's weird but I could see why someone would want that.

“[reasons]” is pretty long, so I'm writing this down in order to avoid repeating the same conversation over and over again.

Simplified organization

With multiple repos, you typically either have one project per repo, or an umbrella of related projects per repo, but that forces you to define what a “project” is for your particular team or company, and it sometimes forces you to split and merge repos for reasons that are pure overhead. For example, having to split a project because it's too big or has too much history for your VCS is not optimal.

With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way. Using a single repo also reduces overhead from managing dependencies.

A side effect of the simplified organization is that it's easier to navigate projects. The monorepos I've used let you essentially navigate as if everything is on a networked file system, re-using the idiom that's used to navigate within projects. Multi repo setups usually have two separate levels of navigation -- the filesystem idiom that's used inside projects, and then a meta-level for navigating between projects.

A side effect of that side effect is that, with monorepos, it's often the case that it's very easy to get a dev environment set up to run builds and tests. If you expect to be able to navigate between projects with the equivalent of cd, you also expect to be able to do cd; make. Since it seems weird for that to not work, it usually works, and whatever tooling effort is necessary to make it work gets done1. While it's technically possible to get that kind of ease in multiple repos, it's not as natural, which means that the necessary work isn't done as often.

Simplified dependencies

This probably goes without saying, but with multiple repos, you need to have some way of specifying and versioning dependencies between them. That sounds like it ought to be straightforward, but in practice, most solutions are cumbersome and involve a lot of overhead.

With a monorepo, it's easy to have one universal version number for all projects. Since atomic cross-project commits are possible (though these tend to split into many parts for practical reasons at large companies), the repository can always be in a consistent state -- at commit #X, all project builds should work. Dependencies still need to be specified in the build system, but whether that's a make Makefiles or bazel BUILD files, those can be checked into version control like everything else. And since there's just one version number, the Makefiles or BUILD files or whatever you choose don't need to specify version numbers.

Tooling

The simplification of navigation and dependencies makes it much easier to write tools. Instead of having tools that must understand relationships between repositories, as well as the nature of files within repositories, tools basically just need to be able to read files (including some file format that specifies dependencies between units within the repo).

This sounds like a trivial thing but, take this example by Christopher Van Arsdale on how easy builds can become:

The build system inside of Google makes it incredibly easy to build software using large modular blocks of code. You want a crawler? Add a few lines here. You need an RSS parser? Add a few more lines. A large distributed, fault tolerant datastore? Sure, add a few more lines. These are building blocks and services that are shared by many projects, and easy to integrate. … This sort of Lego-like development process does not happen as cleanly in the open source world. … As a result of this state of affairs (more speculation), there is a complexity barrier in open source that has not changed significantly in the last few years. This creates a gap between what is easily obtainable at a company like Google versus a[n] open sourced project.

The system that Arsdale is referring to is so convenient that, before it was open sourced, ex-Google engineers at Facebook and Twitter wrote their own versions of bazel in order to get the same benefits.

It's theoretically possible to create a build system that makes building anything, with any dependencies, simple without having a monorepo, but it's more effort, enough effort that I've never seen a system that does it seamlessly. Maven and sbt are pretty nice, in a way, but it's not uncommon to lose a lot of time tracking down and fixing version dependency issues. Systems like rbenv and virtualenv try to sidestep the problem, but they result in a proliferation of development environments. Using a monorepo where HEAD always points to a consistent and valid version removes the problem of tracking multiple repo versions entirely2.

Build systems aren't the only thing that benefit from running on a mono repo. Just for example, static analysis can run across project boundaries without any extra work. Many other things, like cross-project integration testing and code search are also greatly simplified.

Cross-project changes

With lots of repos, making cross-repo changes is painful. It typically involves tedious manual coordination across each repo or hack-y scripts. And even if the scripts work, there's the overhead of correctly updating cross-repo version dependencies. Refactoring an API that's used across tens of active internal projects will probably a good chunk of a day. Refactoring an API that's used across thousands of active internal projects is hopeless.

With a monorepo, you just refactor the API and all of its callers in one commit. That's not always trivial, but it's much easier than it would be with lots of small repos. I've seen APIs with thousands of usages across hundreds of projects get refactored and with a monorepo setup it's so easy that it's no one even thinks twice.

Most people now consider it absurd to use a version control system like CVS, RCS, or ClearCase, where it's impossible to do a single atomic commit across multiple files, forcing people to either manually look at timestamps and commit messages or keep meta information around to determine if some particular set of cross-file changes are “really” atomic. SVN, hg, git, etc solve the problem of atomic cross-file changes; monorepos solve the same problem across projects.

This isn't just useful for large-scale API refactorings. David Turner, who worked on twitter's migration from many repos to a monorepo gives this example of a small cross-cutting change and the overhead of having to do releases for those:

I needed to update [Project A], but to do that, I needed my colleague to fix one of its dependencies, [Project B]. The colleague, in turn, needed to fix [Project C]. If I had had to wait for C to do a release, and then B, before I could fix and deploy A, I might still be waiting. But since everything's in one repo, my colleague could make his change and commit, and then I could immediately make my change.

I guess I could do that if everything were linked by git versions, but my colleague would still have had to do two commits. And there's always the temptation to just pick a version and "stabilize" (meaning, stagnate). That's fine if you just have one project, but when you have a web of projects with interdependencies, it's not so good.

[In the other direction,] Forcing dependees to update is actually another benefit of a monorepo.

It's not just that making cross-project changes is easier, tracking them is easier, too. To do the equivalent of git bisect across multiple repos, you must be disciplined about using another tool to track meta information, and most projects simply don't do that. Even if they do, you now have two really different tools where one would have sufficed.

Ironically, there's a sense in which this benefit decreases as the company gets larger. At Twitter, which isn't exactly small, David Turner got a lot of value out of being able to ship cross-project changes. But at a Google-sized company, large commits can be large enough that it makes sense to split them into many smaller commits for a variety of reasons, which necessitates tooling that can effectively split up large conceptually atomic changes into many non-atomic commits.

Mercurial and git are awesome; it's true

The most common response I've gotten to these points is that switching to either git or hg from either CVS or SVN is a huge productivity win. That's true. But a lot of that is because git and hg are superior in multiple respects (e.g., better merging), not because having small repos is better per se.

In fact, Twitter has been patching git and Facebook has been patching Mercurial in order to support giant monorepos.

Downsides

Of course, there are downsides to using a monorepo. I'm not going to discuss them because the downsides are already widely discussed. Monorepos aren't strictly superior to manyrepos. They're not strictly worse, either. My point isn't that you should definitely switch to a monorepo; it's merely that using a monorepo isn't totally unreasonable, that folks at places like Google, Facebook, Twitter, Digital Ocean, and Etsy might have good reasons for preferring a monorepo over hundreds or thousands or tens of thousands of smaller repos.

Other discussion

Gregory Szorc. Facebook. Benjamin Pollack (one of the co-creators of Kiln). Benjamin Eberlei. Simon Stewart. Digital Ocean. Google. Twitter. thedufer. Paul Hammant.

Thanks to Kamal Marhubi, David Turner, Leah Hanson, Mindy Preston, Chris Ball, Daniel Espeset, Joe Wilder, Nicolas Grilly, Giovanni Gherdovich, Paul Hammant, Juho Snellman, and Simon Thulbourn for comments/corrections/discussion.


  1. This was even true at a hardware company I worked at which created a monorepo by versioning things in RCS over NFS. Of course you can't let people live edit files in the central repository so someone wrote a number of scripts that basically turned this into perforce. I don't recommend this system, but even with an incredibly hacktastic monorepo, you still get a lot of the upsides of a monorepo. [return]
  2. At least as long as you have some mechanism for vendoring upstream dependencies. While this works great for Google because Google writes a large fraction of the code it relies on, and has enough employees that tossing all external dependencies into the monorepo has a low cost amortized across all employees, I could imagine this advantage being too expensive to take advantage of for smaller companies. [return]

We used to build steel mills near cheap power. Now that's where we build datacenters

2015-05-04 08:00:00

Why are people so concerned with hardware power consumption nowadays? Some common answers to this question are that power is critically important for phones, tablets, and laptops and that we can put more silicon on a modern chip than we can effectively use. In 2001 Patrick Gelsinger observed that if scaling continued at then-current rates, chips would have the power density of a nuclear reactor by 2005, a rocket nozzle by 2010, and the surface of the sun by 2015, implying that power density couldn't continue on its then-current path. Although this was already fairly obvious at the time, now that it's 2015, we can be extra sure that power density didn't continue to grow at unbounded rates. Anyway, the importance of portables and scaling limits are both valid and important reasons, but since they're widely discussed, I'm going to talk about an underrated reason.

People often focus on the portable market because it's cannibalizing desktop market, but that's not the only growth market -- servers are also becoming more important than desktops, and power is really important for servers. To see why power is important for servers, let's look at some calculations about how what it costs to run a datacenter from Hennessy & Patterson.

One of the issues is that you pay for power multiple times. Some power is lost at the substation, although we might not have to pay for that directly. Then we lose more storing energy in a UPS. This figure below states 6%, but smaller scale datacenters can easily lose twice that. After that, we lose more power stepping down the power to a voltage that a server can accept. That's over a 10% loss for a setup that's pretty efficient.

After that, we lose more power in the server's power supply, stepping down the voltage to levels that are useful inside a computer, which is often about another 10% loss (not pictured in the figure below).

Power conversion figure from Hennessy & Patterson, which reproduced the figure from Hamilton

And then once we get the power into servers, it gets turned into waste heat. To keep the servers from melting, we have to pay for power to cool them. Barroso and Holzle estimated that 30%-50% of the power drawn by a datacenter is used for chillers, and that an additional 10%-20% is for the CRAC (air circulation). That means for every watt of power used in the server, we pay for another 1-2 watts of support power.

And to actually get all this power, we have to pay for the infrastructure required to get the power into and throughout the datacenter. Hennessy & Patterson estimate that of the $90M cost of an example datacenter (just the facilities -- not the servers), 82% is associated with power and cooling1. The servers in the datacenter are estimated to only cost $70M. It's not fair to compare those numbers directly since servers need to get replaced more often than datacenters; once you take into account the cost over the entire lifetime of the datacenter, the amortized cost of power and cooling comes out to be 33% of the total cost, when servers have a 3 year lifetime and infrastructure has a 10-15 year lifetime.

If we look at all the costs, the breakdown is:

category%
server machines53
power & cooling infra20
power use13
networking8
other infra4
humans2

Power use and people are the cost of operating the datacenter (OPEX), whereas server machines, networking, power & cooling infra, and other infra are capital expenditures that are amortized across the lifetime of the datacenter (CAPEX).

Computation uses a lot of power. We used to build steel mills near cheap sources of power, but now that's where we build datacenters. As companies start considering the full cost of applications, we're seeing a lot more power optimized solutions2. Unfortunately, this is really hard. On the software side, with the exceptions of toy microbenchmark examples, best practices for writing power efficient code still aren't well understood. On the hardware side, Intel recently released a new generation of chips with significantly improved performance per watt that doesn't have much better absolute performance than the previous generation. On the hardware accelerator front, some large companies are building dedicated power-efficient hardware for specific computations. But with existing tools, hardware accelerators are costly enough that dedicated hardware only makes sense for the largest companies. There isn't an easy answer to this problem.

If you liked this post, you'd probably like chapter 6 of Hennessy & Patterson, which walks through not only the cost of power, but a number of related back of the envelope calculations relating to datacenter performance and cost.

Apologies for the quickly scribbled down post. I jotted this down shortly before signing an NDA for an interview where I expected to learn some related information and I wanted to make sure I had my thoughts written down before there was any possibility of being contaminated with information that's under NDA.

Thanks to Justin Blank for comments/corrections/discussion.


  1. Although this figure is widely cited, I'm unsure about the original source. This is probably the most suspicious figure in this entire post. Hennessy & Patterson cite “Hamilton 2010”, which appears to be a reference to this presentation. That presentation doesn't make the source of the number obvious, although this post by Hamilton does cite a reference for that figure, but the citation points to this post, which seems to be about putting datacenters in tents, not the fraction of infrastructure that's dedicated to power and cooling.

    Some other works, such as this one cite this article. However, that article doesn't directly state 82% anywhere, and it makes a number of estimates that the authors acknowledge are very rough, with qualifiers like “While, admittedly, the authors state that there is a large error band around this equation, it is very useful in capturing the magnitude of infrastructure cost.”

    [return]
  2. That being said, power isn't everything -- Reddi et al. looked at replacing conventional chips with low-power chips for a real workload (MS Bing) and found that while they got an improvement in power use per query, tail latency increased significantly, especially when servers were heavily loaded. Since Bing has a mechanism that causes query-related computations to terminate early if latency thresholds are hit, the result is both higher latency and degraded search quality. [return]

Reading citations is easier than most people think

2015-03-29 08:00:00

It's really common to see claims that some meme is backed by “studies” or “science”. But when I look at the actual studies, it usually turns out that the data are opposed to the claim. Here are the last few instances of this that I've run across.

Dunning-Kruger

A pop-sci version of Dunning-Kruger, the most common one I see cited, is that, the less someone knows about a subject, the more they think they know. Another pop-sci version is that people who know little about something overestimate their expertise because their lack of knowledge fools them into thinking that they know more than they do. The actual claim Dunning and Kruger make is much weaker than the first pop-sci claim and, IMO, the evidence is weaker than the second claim. The original paper isn't much longer than most of the incorrect pop-sci treatments of the paper, and we can get pretty good idea of the claims by looking at the four figures included in the paper. In the graphs below, “perceived ability” is a subjective self rating, and “actual ability” is the result of a test.

Dunning-Kruger graph Dunning-Kruger graph Dunning-Kruger graph Dunning-Kruger graph

In two of the four cases, there's an obvious positive correlation between perceived skill and actual skill, which is the opposite of the first pop-sci conception of Dunning-Kruger that we discussed. As for the second, we can see that people at the top end also don't rate themselves correctly, so the explanation for Dunning-Kruger's results is that people who don't know much about a subject (an easy interpretation to have of the study, given its title, Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments) is insufficient because that doesn't explain why people at the top of the charts have what appears to be, at least under the conditions of the study, a symmetrically incorrect guess about their skill level. One could argue that there's a completely different effect that just happens to cause the same, roughly linear, slope in perceived ability that people who are "unskilled and unaware of it" have. But, if there's any plausible simpler explanation, then that explanation seems overly complicated without additional evidence (which, if any exists, is not provided in the paper)1.

A plausible explanation of why perceived skill is compressed, especially at the low end, is that few people want to rate themselves as below average or as the absolute best, shrinking the scale but keeping a roughly linear fit. The crossing point of the scales is above the median, indicating that people, on average, overestimate themselves, but that's not surprising given the population tested (more on this later). In the other two cases, the correlation is very close to zero. It could be that the effect is different for different tasks, or it could be just that the sample size is small and that the differences between the different tasks is noise. It could also be that the effect comes from the specific population sampled (students at Cornell, who are probably actually above average in many respects). If you look up Dunning-Kruger on Wikipedia, it claims that a replication of Dunning-Kruger on East Asians shows the opposite result (perceived skill is lower than actual skill, and the greater the skill, the greater the difference), and that the effect is possibly just an artifact of American culture, but the citation is actually a link to an editorial which mentions a meta analysis on East Asian confidence, so that might be another example of a false citation. Or maybe it's just a link to the wrong source. In any case, the effect certainly isn't that the more people know, the less they think they know.

Income & Happiness

It's become common knowledge that money doesn't make people happy. As of this writing, a Google search for happiness income returns a knowledge card that making more than $75k/year has no impact on happiness. Other top search results claim the happiness ceiling occurs at $10k/year, $30k/year, $40k/year and $75k/year.

Google knowledge card says that $75k should be enough for anyone

Not only is that wrong, the wrongness is robust across every country studied, too.

People with more income are happier

That happiness is correlated with income doesn't come from cherry picking one study. That result holds across five iterations of the World Values Survey (1981-1984, 1989-1993, 1994-1999, 2000-2004, and 2005-2009), three iterations of the Pew Global Attitudes Survey (2002, 2007, 2010), five iterations of the International Social Survey Program (1991, 1998, 2001, 2007, 2008), and a large scale Gallup survey.

The graph above has income on a log scale, if you pick a country and graph the results on a linear scale, you get something like this.

Best fit log to happiness vs. income

As with all graphs of a log function, it looks like the graph is about to level off, which results in interpretations like the following:

Distorted log graph

That's an actual graph from an article that claims that income doesn't make people happy. These vaguely log-like graphs that level off are really common. If you want to see more of these, try an image search for “happiness income”. My favorite is the one where people who make enough money literally hit the top of the scale. Apparently, there's a dollar value which not only makes you happy, it makes you as happy as it is possible for humans to be.

As with Dunning-Kruger, you can look at the graphs in the papers to see what's going on. It's a little easier to see why people would pass along the wrong story here, since it's easy to misinterpret the data when it's plotted against a linear scale, but it's still pretty easy to see what's going on by taking a peek at the actual studies.

Hedonic Adaptation & Happiness

The idea that people bounce back from setbacks (as well as positive events) and return to a fixed level of happiness entered the popular consciousness after Daniel Gilbert wrote about it in a popular book.

But even without looking at the literature on adaptation to adverse events, the previous section on wealth should cast some doubt on this. If people rebound from both bad events and good, how is it that making more money causes people to be happier?

Turns out, the idea that people adapt to negative events and return to their previous set-point is a myth. Although the exact effects vary depending on the bad event, disability2, divorce3, loss of a partner4, and unemployment5 all have long-term negative effects on happiness. Unemployment is the one event that can be undone relatively easily, but the effects persist even after people become reemployed. I'm only citing four studies here, but a meta analysis of the literature shows that the results are robust across existing studies.

The same thing applies to positive events. While it's “common knowledge” that winning the lottery doesn't make people happier, it turns out that isn't true, either.

In both cases, early cross-sectional results indicated that it's plausible that extreme events, like winning the lottery or becoming disabled, don't have long term effects on happiness. But the longitudinal studies that follow individuals and measure the happiness of the same person over time as events happen show the opposite result -- events do, in fact, affect happiness. For the most part, these aren't new results (some of the initial results predate Daniel Gilbert's book), but the older results based on less rigorous studies continue to propagate faster than the corrections.

Chess position memorization

I frequently see citations claiming that, while experts can memorize chess positions better than non-experts, the advantage completely goes away when positions are randomized. When people refer to a specific citation, it's generally Chase and Simon's 1973 paper Perception in Chess, a "classic" which has been cited a whopping 7449 times in the literature, which says:

De Groat did, however, find an intriguing difference between masters and weaker players in his short-term memory experiments. Masters showed a remarkable ability to reconstruct a chess position almost perfectly after viewing it for only 5 sec. There was a sharp dropoff in this ability for players below the master level. This result could not be attributed to the masters’ generally superior memory ability, for when chess positions were constructed by placing the same numbers of pieces randomly on the board, the masters could then do no better in reconstructing them than weaker players, Hence, the masters appear to be constrained by the same severe short-term memory limits as everyone else ( Miller, 1956), and their superior performance with "meaningful" positions must lie in their ability to perceive structure in such positions and encode them in chunks. Specifically, if a chess master can remember the location of 20 or more pieces on the board, but has space for only about five chunks in short-term memory, then each chunk must be composed of four or five pieces, organized in a single relational structure.

The paper then runs an experiment which "proves" that master-level players actually do worse than beginners when memorizing random mid-game positions even though they do much better memorizing real mid-game positions (and, in end-game positions, they do the about the same as beginners when positions are randomized). Unfortunately, the paper used an absurdly small sample size of one chess player at each skill level.

A quick search indicates that this result does not reproduce with larger sample sizes, e.g., Gobet and Simon, in "Recall of rapidly presented random chess positions is a function of skill", say

A widely cited result asserts that experts’ superiority over novices in recalling meaningful material from their domain of expertise vanishes when they are confronted with random material. A review of recent chess experiments in which random positions served as control material (presentation time between 3 and 10 sec) shows, however, that strong players generally maintain some superiority over weak players even with random positions, although the relative difference between skill levels is much smaller than with game positions. The implications of this finding for expertise in chess are discussed and the question of the recall of random material in other domains is raised.

They find this scales with skill level and, e.g., for "real" positions, 2350+ ELO players memorized ~2.2x the number of correct pieces that 1600-2000 ELO players did, but the difference was ~1.6x for random positions (these ratios are from eyeballing a graph and may be a bit off). 1.6x is smaller than 2.2x, but it's certainly not the claimed 1.0.

I've also seen this result cited to claim that it applies to other fields, but in a quick search of applying this result to other fields, results either show something similar (a smaller but still observable difference on randomized positions) or don't reproduce, e.g., McKeithen did this for programmers and found that, on trying to memorize programs, on "normal" program experts were ~2.5x better than beginners on the first trial and 3x better by the 6th trial, whereas on the "scrambled" program, experts were 3x better on the first trial and progressed to being only ~1.5x better by the 6th trial. Despite this result contradicting Chase and Simon, I've seen people cite this result to claim the same thing as Chase and Simon, presumably from people who didn't read what McKeithen actually wrote.

Type Systems

Unfortunately, false claims about studies and evidence aren't limited to pop-sci memes; they're everywhere in both software and hardware development. For example, see this comment from a Scala/FP "thought leader":

Tweet claiming that any doubt that type systems are helpful is equivalent to being an anti-vaxxer

I see something like this at least once a week. I'm picking this example not because it's particularly egregious, but because it's typical. If you follow a few of the big time FP proponents on twitter, you'll see regularly claims that there's very strong empirical evidence and extensive studies backing up the effectiveness of type systems.

However, a review of the empirical evidence shows that the evidence is mostly incomplete, and that it's equivocal where it's not incomplete. Of all the false memes, I find this one to be the hardest to understand. In the other cases, I can see a plausible mechanism by which results could be misinterpreted. “Relationship is weaker than expected” can turn into “relationship is opposite of expected”, log can look a lot like an asymptotic function, and preliminary results using inferior methods can spread faster than better conducted follow-up studies. But I'm not sure what the connection between the evidence and beliefs are in this case.

Is this preventable?

I can see why false memes might spread quickly, even when they directly contradict reliable sources. Reading papers sounds like a lot of work. It sometimes is. But it's often not. Reading a pure math paper is usually a lot of work. Reading an empirical paper to determine if the methodology is sound can be a lot of work. For example, biostatistics and econometrics papers tend to apply completely different methods, and it's a lot of work to get familiar enough with the set of methods used in any particular field to understand precisely when they're applicable and what holes they have. But reading empirical papers just to see what claims they make is usually pretty easy.

If you read the abstract and conclusion, and then skim the paper for interesting bits (graphs, tables, telling flaws in the methodology, etc.), that's enough to see if popular claims about the paper are true in most cases. In my ideal world, you could get that out of just reading the abstract, but it's not uncommon for papers to make claims in the abstract that are much stronger than the claims made in the body of the paper, so you need to at least skim the paper.

Maybe I'm being naive here, but I think a major reason behind false memes is that checking sources sounds much harder and more intimidating than it actually is. A striking example of this is when Quartz published its article on how there isn't a gender gap in tech salaries, which cited multiple sources that showed the exact opposite. Twitter was abuzz with people proclaiming that the gender gap has disappeared. When I published a post which did nothing but quote the actual cited studies, many of the same people then proclaimed that their original proclamation was mistaken. It's great that they were willing to tweet a correction6, but as far as I can tell no one actually went and read the source data, even though the graphs and tables make it immediately obvious that the author of the original Quartz article was pushing an agenda, not even with cherry picked citations, but citations that showed the opposite of their thesis.

Unfortunately, it's in the best interests of non-altruistic people who do read studies to make it seem like reading studies is difficult. For example, when I talked to the founder of a widely used pay-walled site that reviews evidence on supplements and nutrition, he claimed that it was ridiculous to think that "normal people" could interpret studies correctly and that experts are needed to read and summarize studies for the masses. But he's just a serial entrepreneur who realized that you can make a lot of money by reading studies and summarizing the results! A more general example is how people sometimes try to maintain an authoritative air by saying that you need certain credentials or markers of prestige to really read or interpret studies.

There are certainly fields where you need some background to properly interpret a study, but even then, the amount of knowledge that a degree contains is quite small and can be picked up by anyone. For example, excluding lab work (none of which contained critical knowledge for interpreting results), I was within a small constact factor of spending one hour of time per credit hour in school. At the conversion rate, an engineering degree from my alma mater costs a bit more than 100 hours and almost all non-engineering degrees land at less than 40 hours, with a large amount of overlap between them because a lot of degrees will require the same classes (e.g., calculus). Gatekeeping reading and interpreting a study on whether or not someone has a credential like a degree is absurd when someone can spend a week's worth of time gaining the knowledge that a degree offers.

If you liked this post, you'll probably enjoy this post on odd discontinuities, this post how the effect of markets on discrimination is more nuanced than it's usually made out to be and this other post discussing some common misconceptions.

2021 update

In retrospect, I think the mystery of the "type systems" example is simple: it's a different kind of fake citation than the others. In the first three examples, a clever, contrarian, but actually wrong idea got passed around. This makes sense because people love clever, contrarian, ideas and don't care very much if they're wrong, so clever, contarian, relatively frequently become viral relative to their correctness.

For the type systems example, it's just that people commonly fabricate evidence and then appeal to authority to support their position. In the post, I was confused because I couldn't see how anyone could look at the evidence and then make the claims that type system advocates do but, after reading thousands of discussions from people advocating for their pet tool/language/practice, I can see that it was naive of me to think that these advocates would even consider looking for evidence as opposed to just pretending that evidence exists without ever having looked.

Thanks to Leah Hanson, Lindsey Kuper, Jay Weisskopf, Joe Wilder, Scott Feeney, Noah Ennis, Myk Pono, Heath Borders, Nate Clark, and Mateusz Konieczny for comments/corrections/discussion.

BTW, if you're going to send me a note to tell me that I'm obviously wrong, please make sure that I'm actually wrong. In general, I get great feedback and I've learned a lot from the feedback that I've gotten, but the feedback I've gotten on this post has been unusually poor. Many people have suggested that the studies I've referenced have been debunked by some other study I clearly haven't read, but in every case so far, I've already read the other study.


  1. Dunning and Kruger claim, without what I'd consider strong evidence, that this is because people who perform well overestimate how well other people perform. While that may be true, one could also say that the explanation for people who are "unskilled" is that they underestimate how well other people perform. "Phase 2" attempts to establish that's not the case, but I don't find the argument convincing for a number of reasons. To pick one example, at the end of the section, they say "Despite seeing the superior performances of their peers, bottom-quartile participants continued to hold the mistaken impression that they had performed just fine.", but we don't know that the participants believed that they performed fine, we just know what their perceived percentile is. It's possible to believe that you're peforming poorly while also being in a high percentile (and I frequently have this belief for activties I haven't seriously practiced or studied, which seems likely to be the case for the participants of the Dunning-Kruger study who scored poorly on tasks with respect to those tassks). [return]
  2. Long-term disability is associated with lasting changes in subjective well-being: evidence from two nationally representative longitudinal studies.

    Hedonic adaptation refers to the process by which individuals return to baseline levels of happiness following a change in life circumstances. Two nationally representative panel studies (Study 1: N = 39,987; Study 2: N = 27,406) were used to investigate the extent of adaptation that occurs following the onset of a long-term disability. In Study 1, 679 participants who acquired a disability were followed for an average of 7.18 years before and 7.39 years after onset of the disability. In Study 2, 272 participants were followed for an average of 3.48 years before and 5.31 years after onset. Disability was associated with moderate to large drops in happiness (effect sizes ranged from 0.40 to 1.27 standard deviations), followed by little adaptation over time.

    [return]
  3. Time does not heal all wounds

    Cross-sectional studies show that divorced people report lower levels of life satisfaction than do married people. However, such studies cannot determine whether satisfaction actually changes following divorce. In the current study, data from an 18-year panel study of more than 30,000 Germans were used to examine reaction and adaptation to divorce. Results show that satisfaction drops as one approaches divorce and then gradually rebounds over time. However, the return to baseline is not complete. In addition, prospective analyses show that people who will divorce are less happy than those who stay married, even before either group gets married. Thus, the association between divorce and life satisfaction is due to both preexisting differences and lasting changes following the event.

    [return]
  4. Reexamining adaptation and the set point model of happiness: Reactions to changes in marital status.

    According to adaptation theory, individuals react to events but quickly adapt back to baseline levels of subjective well-being. To test this idea, the authors used data from a 15-year longitudinal study of over 24,000 individuals to examine the effects of marital transitions on life satisfaction. On average, individuals reacted to events and then adapted back toward baseline levels. However, there were substantial individual differences in this tendency. Individuals who initially reacted strongly were still far from baseline years later, and many people exhibited trajectories that were in the opposite direction to that predicted by adaptation theory. Thus, marital transitions can be associated with long-lasting changes in satisfaction, but these changes can be overlooked when only average trends are examined.

    [return]
  5. Unemployment Alters the Set-Point for Life Satisfaction

    According to set-point theories of subjective well-being, people react to events but then return to baseline levels of happiness and satisfaction over time. We tested this idea by examining reaction and adaptation to unemployment in a 15-year longitudinal study of more than 24,000 individuals living in Germany. In accordance with set-point theories, individuals reacted strongly to unemployment and then shifted back toward their baseline levels of life satisfaction. However, on average, individuals did not completely return to their former levels of satisfaction, even after they became reemployed. Furthermore, contrary to expectations from adaptation theories, people who had experienced unemployment in the past did not react any less negatively to a new bout of unemployment than did people who had not been previously unemployed. These results suggest that although life satisfaction is moderately stable over time, life events can have a strong influence on long-term levels of subjective well-being.

    [return]
  6. One thing I think it's interesting to look at is how you can see the opinions of people who are cagey about revealing their true opinions in which links they share. For example, Scott Alexander and Tyler Cowen both linked to the bogus gender gap article as something interesting to read and tend to link to things that have the same view.

    If you naively read their writing, it appears as if they're impartially looking at evidence about how the world works, which they then share with people. But when you observe that they regularly share evidence that supports one narrative, regardless of quality, and don't share evidence that supports the opposite narrative, it would appear that they have a strong opinion on the issue that they reveal via what they link to.

    [return]

Given that we spend little on testing, how should we test software?

2015-03-10 08:00:00

I've been reading a lot about software testing, lately. Coming from a hardware background (CPUs and hardware accelerators), it's interesting how different software testing is. Bugs in software are much easier to fix, so it makes sense to spend a lot less effort spent on testing. Because less effort is spent on testing, methodologies differ; software testing is biased away from methods with high fixed costs, towards methods with high variable costs. But that doesn't explain all of the differences, or even most of the differences. Most of the differences come from a cultural path dependence, which shows how non-optimally test effort is allocated in both hardware and software.

I don't really know anything about software testing, but here are some notes from what I've seen at Google, on a few open source projects, and in a handful of papers and demos. Since I'm looking at software, I'm going to avoid talking about how hardware testing isn't optimal, but I find that interesting, too.

Manual Test Generation

From what I've seen, most test effort on most software projects comes from handwritten tests. On the hardware projects I know of, writing tests by hand consumed somewhere between 1% and 25% of the test effort and was responsible for a much smaller percentage of the actual bugs found. Manual testing is considered ok for sanity checking, and sometimes ok for really dirty corner cases, but it's not scalable and too inefficient to rely on.

It's true that there's some software that's difficult to do automated testing on, but the software projects I've worked on have relied pretty much totally on manual testing despite being in areas that are among the easiest to test with automated testing. As far as I can tell, that's not because someone did a calculation of the tradeoffs and decided that manual testing was the way to go, it's because it didn't occur to people that there were alternatives to manual testing.

So, what are the alternatives?

Random Test Generation

The good news is that random testing is easy to implement. You can spend an hour implementing a random test generator and find tens of bugs, or you can spend more time and find thousands of bugs.

You can start with something that's almost totally random and generates incredibly dumb tests. As you spend more time on it, you can add constraints and generate smarter random tests that find more complex bugs. Some good examples of this are jsfunfuzz, which started out relatively simple and gained smarts as time went out, and Jepsen, which originally checked some relatively simple constraints and can now check linearizability.

While you can generate random tests pretty easily, it still takes some time to write a powerful framework or collection of functions. Luckily, this space is well covered by existing frameworks.

Random Test Generation, Framework

Here's an example of how simple it is to write a JavaScript tests using Scott Feeney's gentest, taken from the gentest readme.

You want to test something like

function add(x, y) {
  return x + y;
}

To check that addition commutes, so you'd write

var t = gentest.types;

forAll([t.int, t.int], 'addition is commutative', function(x, y) {
  return add(x, y) === add(y, x);
});

Instead of checking the values by hand, or writing the code to generate the values, the framework handles that and generates tests for after you when you specify the constraints. QuickCheck-like generative test frameworks tend to be simple enough that they're no harder to learn how to use than any other unit test or mocking framework.

You'll sometimes hear objections about how random testing can only find shallow bugs because random tests are too dumb to find really complex bugs. For one thing, that assumes that you don't specify constraints that allow the random generator to generate intricate test cases. But even then, this paper analyzed production failures in distributed systems, looking for "critical" bugs, bugs that either took down the entire cluster or caused data corruption, and found that 58% could be caught with very simple tests. Turns out, generating “shallow” random tests is enough to catch most production bugs. And that's on projects that are unusually serious about testing and static analysis, projects that have much better test coverage than the average project.

A specific examples of the effective of naive random testing this is the story John Hughes tells in this talk. It starts out when some people came to him with a problem.

We know there is a lurking bug somewhere in the dets code. We have got 'bad object' and 'premature eof' every other month the last year. We have not been able to track the bug down since the dets files is repaired automatically next time it is opened.

Stack: Application on top of Mnesia on top of Dets on top of File system

An application that ran on top of Mnesia, a distributed database, was somehow causing errors a layer below the database. There were some guesses as to the cause. Based on when they'd seen the failures, maybe something to do with rehashing something or other in files that are bigger than 1GB? But after more than a month of effort, no one was really sure what was going on.

In less than a day, with QuickCheck, they found five bugs. After fixing those bugs, they never saw the problem again. Each of the five bugs was reproducible on a database with one record, with at most five function calls. It is very common for bugs that have complex real-world manifestations to be reproducible with really simple test cases, if you know where to look.

In terms of developer time, using some kind of framework that generates random tests is a huge win over manually writing tests in a lot of circumstances, and it's so trivially easy to try out that there's basically no reason not to do it. The ROI of using more advanced techniques may or may not be worth the extra investment to learn how to implement and use them.

While dumb random testing works really well in a lot of cases, it has limits. Not all bugs are shallow. I know of a hardware company that's very good at finding deep bugs by having people with years or decades of domain knowledge write custom test generators, which then run on N-thousand machines. That works pretty well, but it requires a lot of test effort, much more than makes sense for almost any software.

The other option is to build more smarts into the program doing the test generation. There are a ridiculously large number of papers on how to do that, but very few of those papers have turned into practical, robust, software tools. The sort of simple coverage-based test generation used in AFL doesn't have that many papers on it, but it seems to be effective.

Random Test Generation, Coverage Based

If you're using an existing framework, coverage-based testing isn't much harder than using any other sort of random testing. In theory, at least. There are often a lot of knobs you can turn to adjust different settings, as well other complexity.

If you're writing a framework, there are a lot of decisions. Chief among them are what coverage metric to use and how to use that coverage metric to drive test generation.

For the first choice, which coverage metric, there are coverage metrics that are tractable, but too simplistic, like function coverage, or line coverage (a.k.a. basic block coverage). It's easy to track those, but it's also easy to get 100% coverage while missing very serious bugs. And then there are metrics that are great, but intractable, like state coverage or path coverage. Without some kind of magic to collapse equivalent paths or states together, it's impossible to track those for non-trivial programs.

For now, let's assume we're not going to use magic, and use some kind of approximation instead. Coming up with good approximations that work in practice often takes a lot of trial and error. Luckily, Michal Zalewski has experimented with a wide variety of different strategies for AFL, a testing tool that instruments code with some coverage metrics that allow the tool to generate smart tests.

AFL does the following. Each branch gets something like the following injected, which approximates tracking edges between basic blocks, i.e., which branches are taken and how many times:

cur_location = <UNIQUE_COMPILE_TIME_RANDOM_CONSTANT>;
shared_mem[prev_location ^ cur_location]++;
prev_location = cur_location >> 1;

shared_mem happens to be a 64kB array in AFL, but the size is arbitrary.

The non-lossy version of this would be to have shared_mem be a map of (prev_location, cur_location) -> int, and increment that. That would track how often each edge (prev_location, cur_location) is taken in the basic block graph.

Using a fixed sized array and xor'ing prev_location and cur_location provides lossy compression. To keep from getting too much noise out of trivial changes, for example, running a loop 1200 times vs. 1201 times, AFL only considers a bucket to have changed when it crosses one of the following boundaries: 1, 2, 3, 4, 8, 16, 32, or 128. That's one of the two things that AFL tracks to determine coverage.

The other is a global set of all (prev_location, cur_location) tuples, which makes it easy to quickly determine if a tuple/transition is new.

Roughly speaking, AFL keeps a queue of “interesting” test cases it's found and generates mutations of things in the queue to test. If something changes the coverage stat, it gets added to the queue. There's also some logic to avoid adding test cases that are too slow, and to remove test cases that are relatively uninteresting.

AFL is about 13k lines of code, so there's clearly a lot more to it than that, but, conceptually, it's pretty simple. Zalewksi explains why he's kept AFL so simple here. His comments are short enough that they're worth reading in their entirety if you're at all interested, but I'll excerpt a few bits anyway.

In the past six years or so, I've also seen a fair number of academic papers that dealt with smart fuzzing (focusing chiefly on symbolic execution) and a couple papers that discussed proof-of-concept application of genetic algorithms. I'm unconvinced how practical most of these experiments were … Effortlessly getting comparable results [from AFL] with state-of-the-art symbolic execution in equally complex software still seems fairly unlikely, and hasn't been demonstrated in practice so far.

Test Generation, Other Smarts

While Zalewski is right that it's hard to write a robust and generalizable tool that uses more intelligence, it's possible to get a lot of mileage out of domain specific tools. For example, BloomUnit, a test framework for distributed systems, helps you test non-deterministic systems by generating a subset of valid orderings, which uses a SAT solver to avoid generating equivalent re-orderings. The authors don't provide benchmark results the same way Zalewksi does with AFL, but even without benchmarks it's at least plausible that a SAT solver can be productively applied to test case generation. If nothing else, distributed system tests are often slow enough that you can do a lot of work without severely impacting test throughput.

Zalewski says “If your instrumentation makes it 10x more likely to find a bug, but runs 100x slower, your users [are] getting a bad deal.“, which is a great point -- gains in test smartness have to be balanced against losses in test throughput, but if you're testing with something like Jepsen, where your program under test actually runs on multiple machines that have to communicate with each other, the test is going to be slow enough that you can spend a lot of computation generating smarter tests before getting a 10x or 100x slowdown.

This same effect makes it difficult to port smart hardware test frameworks to software. It's not unusual for a “short” hardware test to take minutes, and for a long test to take hours or days. As a result, spending a massive amount of computation to generate more efficient tests is worth it, but naively porting a smart hardware test framework1 to software is a recipe for overly clever inefficiency.

Why Not Coverage-Based Unit Testing?

QuickCheck and the tens or hundreds of QuickCheck clones are pretty effective for random unit testing, and AFL is really amazing at coverage-based pseudo-random end-to-end test generation to find crashes and security holes. How come there isn't a tool that does coverage-based unit testing?

I often assume that if there isn't an implementation of a straightforward idea, there must be some reason, like maybe it's much harder than it sounds, but Mindy convinced me that there's often no reason something hasn't been done before, so I tried making the simplest possible toy implementation.

Before I looked at AFL's internals, I created this really dumb function to test. The function takes an array of arbitrary length as input and is supposed to return a non-zero int.

// Checks that a number has its bottom bits set
func some_filter(x int) bool {
	for i := 0; i < 16; i = i + 1 {
		if !(x&1 == 1) {
			return false
		}
		x >>= 1
	}
	return true
}

// Takes an array and returns a non-zero int
func dut(a []int) int {
	if len(a) != 4 {
		return 1
	}

	if some_filter(a[0]) {
		if some_filter(a[1]) {
			if some_filter(a[2]) {
				if some_filter(a[3]) {
					return 0 // A bug! We failed to return non-zero!
				}
				return 2
			}
			return 3
		}
		return 4
	}
	return 5
}

dut stands for device under test, a commonly used term in the hardware world. This code is deliberately contrived to make it easy for a coverage based test generator to make progress. Since the code does little work as possible per branch and per loop iteration, the coverage metric changes every time we do a bit of additional work2. It turns out, that a lot of software acts like this, despite not being deliberately built this way.

Random testing is going to have a hard time finding cases where dut incorrectly returns 0. Even if you set the correct array length, a total of 64 bits have to be set to particular values, so there's a 1 in 2^64 chance of any particular random input hitting the failure.

But a test generator that uses something like AFL's fuzzing algorithm hits this case almost immediately. Turns out, with reasonable initial inputs, it even finds a failing test case before it really does any coverage-guided test generation because the heuristics AFL uses for generating random tests generate an input that covers this case.

That brings up the question of why QuickCheck and most of its clones don't use heuristics to generate random numbers. The QuickCheck paper mentions that it uses random testing because it's nearly as good as partition testing and much easier to implement. That may be true, but it doesn't mean that generating some values using simple heuristics can't generate better results with the same amount of effort. Since Zalewski has already done the work of figuring out, empirically, what heuristics are likely to exercise more code paths, it seems like a waste to ignore that and just generate totally random values.

Whether or not it's worth it to use coverage guided generation is a bit iffier; it doesn't prove anything that a toy coverage-based unit testing prototype can find a bug in a contrived function that's amenable to coverage based testing. But that wasn't the point. The point was to see if there was some huge barrier that should prevent people from doing coverage-driven unit testing. As far as I can tell, there isn't.

It helps that the implementation of the golang is very well commented and has good facilities for manipulating go code, which makes it really easy to modify its coverage tools to generate whatever coverage metrics you want, but most languages have some kind of coverage tools that can be hacked up to provide the appropriate coverage metrics so it shouldn't be too painful for any mature language. And once you've got the coverage numbers, generating coverage-guided tests isn't much harder than generating random QuickCheck like tests. There are some cases where it's pretty difficult to generate good coverage-guided tests, like when generating functions to test a function that uses higher-order functions, but even in those cases you're no worse off than you would be with a QuickCheck clone3.

Test Time

It's possible to run software tests much more quickly than hardware tests. One side effect of that is that it's common to see people proclaim that all tests should run in time bound X, and you're doing it wrong if they don't. I've heard various values of X from 100ms to 5 minutes. Regardless of the validity of those kinds of statements, a side effect of that attitude is that people often think that running a test generator for a few hours is A LOT OF TESTING. I overheard one comment about how a particular random test tool had found basically all the bugs it could find because, after a bunch of bug fixes, it had been run for a few hours without finding any additional bugs.

And then you have hardware companies, which will dedicate thousands of machines to generating and running tests. That probably doesn't make sense for a software company, but considering the relative cost of a single machine compared to the cost of a developer, it's almost certainly worth dedicating at least one machine to generating and running tests. And for companies with their own machines, or dedicated cloud instances, generating tests on idle machines is pretty much free.

Attitude

In "Lessons Learned in Software Testing", the authors mention that QA shouldn't be expected to find all bugs and that QA shouldn't have veto power over releases because it's impossible to catch most important bugs, and thinking that QA will do so leads to sloppiness. That's a pretty common attitude on the software teams I've seen. But on hardware teams, it's expected that all “bad” bugs will be caught before the final release and QA will shoot down a release if it's been inadequately tested. Despite that, devs are pretty serious about making things testable by avoiding unnecessary complexity. If a bad bug ever escapes (e.g., the Pentium FDIV bug or the Haswell STM bug), there's a post-mortem to figure out how the test process could have gone so wrong that a significant bug escaped.

It's hard to say how much of the difference in bug count between hardware and software is attitude, and how much is due to the difference in the amount of effort expended on testing, but I think attitude is a significant factor, in addition to the difference in resources.

It affects everything4, down to what level of tests people write. There's a lot of focus on unit testing in software. In hardware, people use the term unit testing, but it usually refers to what would be called an integration test in software. It's considered too hard to thoroughly test every unit; it's much less total effort to test “units” that lie on clean API boundaries (which can be internal or external), so that's where test effort is concentrated.

This also drives test generation. If you accept that bad bugs will occur frequently, manually writing tests is ok. But if your goal is to never release a chip with a bad bug, there's no way to do that when writing tests by hand, so you'll rely on some combination of random testing, manual testing for tricky edge cases, and formal methods. If you then decide that you don't have the resources to avoid bad bugs all the time, and you have to scale things back, you'll be left with the most efficient bug finding methods, which isn't going to leave a lot of room for writing tests by hand.

Conclusion

A lot of projects could benefit from more automated testing. Basically every language has a QuickCheck-like framework available, but most projects that are amenable to QuickCheck still rely on manual tests. For all but the tiniest companies, dedicating at least one machine for that kind of testing is probably worth it.

I think QuickCheck-like frameworks could benefit from using a coverage driven approach. It's certainly easy to implement for functions that take arrays of ints, but that's also pretty much the easiest possible case for something that uses AFL-like test generation (other than, maybe, an array of bytes). It's possible that this is much harder than I think, but if so, I don't see why.

My background is primarily in hardware, so I could be totally wrong! If you have a software testing background, I'd be really interested in hearing what you think. Also, I haven't talked about the vast majority of the topics that testing covers. For example, figuring out what should be tested is really important! So is figuring out how where nasty bugs might be hiding, and having a good regression test setup. But those are pretty similar between hardware and software, so there's not much to compare and contrast.

Resources

Brian Marick on code coverage, and how it can be misused.

If a part of your test suite is weak in a way that coverage can detect, it's likely also weak in a way coverage can't detect.

I'm used to bugs being thought of in the same way -- if a test generator takes a month to catch a bug in an area, there are probably other subtle bugs in the same area, and more work needs to be done on the generator to flush them out.

Lessons Learned in Software Testing: A Context-Driven Approach, by Kaner, Bach, & Pettichord. This book is too long to excerpt, but I find it interesting because it reflects a lot of conventional wisdom.

AFL whitepaper, AFL historical notes, and AFL code tarball. All of it is really readable. One of the reasons I spent so much time looking at AFL is because of how nicely documented it is. Another reason is, of course, that it's been very effective at finding bugs on a wide variety of projects.

Update: Dmitry Vyukov's Go-fuzz, which looks like it was started a month after this post was written, uses the approach from the proof of concept in this post of combining the sort of logic seen in AFL with a QuickCheck-like framework, and has been shown to be quite effective. I believe David R. MacIver is also planning to use this approach in the next version of hypothesis.

And here's some testing related stuff of mine: everything is broken, builds are broken, julia is broken, and automated bug finding using analytics.

Terminology

I use the term random testing a lot, in a way that I'm used to using it among hardware folks. I probably mean something broader than what most software folks mean when they say random testing. For example, here's how sqlite describes their testing. There's one section on fuzz (random) testing, but it's much smaller than the sections on, say, I/O error testing or OOM testing. But as a hardware person, I'd also put I/O error testing or OOM testing under random testing because I'd expect to use randomly generated tests to test those.

Acknowledgments

I've gotten great feedback from a lot of software folks! Thanks to Leah Hanson, Mindy Preston, Allison Kaptur, Lindsey Kuper, Jamie Brandon, John Regehr, David Wragg, and Scott Feeney for providing comments/discussion/feedback.


  1. This footnote is a total tangent about a particular hardware test framework! You may want to skip this!

    SixthSense does a really good job of generating smart tests. It takes as input, some unit or collection of units (with assertions), some checks on the outputs, and some constraints on the inputs. If you don't give it any constraints, it assumes that any input is legal.

    Then it runs for a while. For units without “too much” state, it will either find a bug or tell you that it formally proved that there are no bugs. For units with “too much” state, it's still pretty good at finding bugs, using some combination of random simulation and exhaustive search.

    Combination of exhaustive search and random execution

    It can issue formal proofs for units with way too much state to brute force. How does it reduce the state space and determine what it's covered?

    I basically don't know. There are at least thirty-seven papers on SixthSense. Apparently, it uses a combination of combinational rewriting, sequential redundancy removal, min-area retiming, sequential rewriting, input reparameterization, localization, target enlargement, state-transition folding, isomoprhic property decomposition, unfolding, semi-formal search, symbolic simulation, SAT solving with BDDs, induction, interpolation, etc..

    My understanding is that SixthSense has had a multi-person team working on it for over a decade. Considering the amount of effort IBM puts into finding hardware bugs, investing tens or hundreds of person years to create a tool like SixthSense is an obvious win for them, but it's not really clear that it makes sense for any software company to make the same investment.

    Furthermore, SixthSense is really slow by software test standards. Because of the massive overhead involved in simulating hardware, SixthSense actually runs faster than a lot of simple hardware tests normally would, but running SixthSense on a single unit can easily take longer than it takes to run all of the tests on most software projects.

    [return]
  2. Among other things, it uses nested if statements instead of && because go's coverage tool doesn't create separate coverage points for && and ||. [return]
  3. Ok, you're slightly worse off due to the overhead of generating and looking at coverage stats, but that's pretty small for most non-trivial programs. [return]
  4. This is another long, skippable, footnote. This difference in attitude also changes how people try to write correct software. I've had "testing is hopelessly inadequate….(it) can be used very effectively to show the presence of bugs but never to show their absence." quoted at me tens of times by software folks, along with an argument that we have to reason our way out of having bugs. But the attitude of most hardware folks is that while the back half of that statement is true, testing (and, to some extent, formal verification) is the least bad way to assure yourself that something is probably free of bad bugs.

    This is even true not just on a macro level, but on a micro level. When I interned at Micron in 2003, I worked on flash memory. I read "the green book", and the handful of papers that were new enough that they weren't in the green book. After all that reading, it was pretty obvious that we (humans) didn't understand all of the mechanisms behind the operation and failure modes of flash memory. There were plausible theories about the details of the exact mechanisms, but proving all of them was still an open problem. Even one single bit of flash memory was beyond human understanding. And yet, we still managed to build reliable flash devices, despite building them out of incompletely understood bits, each of which would eventually fail due to some kind of random (in the quantum sense) mechanism.

    It's pretty common for engineering to advance faster than human understanding of the underlying physics. When you work with devices that aren't understood and assembly them to create products that are too complex for any human to understand or for any known technique to formally verify, there's no choice but to rely on testing. With software, people often have the impression that it's possible to avoid relying on testing because it's possible to just understand the whole thing.

    [return]

What happens when you load a URL?

2015-03-07 08:00:00

I've been hearing this question a lot lately, and when I do, it reminds me how much I don't know. Here are some questions this question brings to mind.

  1. How does a keyboard work? Why can’t you press an arbitrary combination of three keys at once, except on fancy gaming keyboards? That implies something about how key presses are detected/encoded.
  2. How are keys debounced? Is there some analog logic, or is there a microcontroller in the keyboard that does this, or what? How do membrane switches work?
  3. How is the OS notified of the keypress? I could probably answer this for a 286, but nowadays it's somehow done through x2APIC, right? How does that work?
  4. Also, USB, PS/2, and AT keyboards are different, somehow? How does USB work? And what about laptop keyboards? Is that just a USB connection?
  5. How does a USB connector work? You have this connection that can handle 10Gb/s. That surely won't work if there's any gap at all between the physical doodads that are being connected. How do people design connectors that can withstand tens of thousands of insertions and still maintain their tolerances?
  6. How does the OS tell the program something happened? How does it know which program to talk to?
  7. How does the browser know to try to load a webpage? I guess it sees an "http://" or just assumes that anything with no prefix is a URL?
  8. Assume we don't have the webpage cached, so we have to do DNS queries and stuff.
  9. How does DNS work? How does DNS caching work? Let's assume it isn't cached at anywhere nearby and we have to go find some far away DNS server.
  10. TCP? We establish a connection? Do we do that for DNS or does it have to be UDP?
  11. How does the OS decide if an outgoing connection should be allowed? What if there's a software firewall? How does that work?
  12. For TCP, without TLS/SSL, we can just do slow-start followed by some standard congestion protocol, right? Is there some deeper complexity there?
  13. One level down, how does a network card work?
  14. For what matter, how does the network card know what to do? Is there a memory region we write to that the network card can see or does it just monitor bus transactions directly?
  15. Ok, say there's a memory region. How does that work? How do we write memory?
  16. Some things happen in the CPU/SoC! This is one of the few areas where I know something, so, I'll skip over that. A signal eventually comes out on some pins. What's that signal? Nowadays, people use DDR3, but we didn't always use that protocol. Presumably DDR3 lets us go faster than DDR2, which was faster than DDR, and so on, but why?
  17. And then the signal eventually goes into a DRAM module. As with the CPU, I'm going to mostly ignore what's going on inside, but I'm curious if DRAM modules still either trench capacitors or stacked capacitors, or has this technology moved on?
  18. Going back to our network card, what happens when the signal goes out on the wire? Why do you need a cat5 and not a cat3 cable for 100Mb Ethernet? Is that purely a signal integrity thing or do the cables actually have different wiring?
  19. One level below that the wires are surely long enough that they can act like transmission lines / waveguides. How is termination handled? Is twisted pair sufficient to prevent inductive coupling or is there more fancy stuff going on?
  20. Say we have a local Ethernet connection to a cable modem. How do cable modems work? Isn't cable somehow multiplexed between different customers? How is it possible to get so much bandwidth through a single coax cable?
  21. Going back up a level, the cable connection eventually gets to the ISP. How does the ISP know where to route things? How does internet routing work? Some bits in the header decide the route? How do routing tables get adjusted?
  22. Also, the 8.8.8.8 DNS stuff is anycast, right? How is that different from routing "normal" traffic? Ditto for anything served from a Cloudflare CDN. What do they need to do to prevent route flapping and other badness?
  23. What makes anycast hard enough to do that very few companies use it?
  24. IIRC, the Stanford/Coursera algorithms course mentioned that it's basically a distributed Bellman-Ford calculation. But what prevents someone from putting bogus routes up?
  25. If we can figure out where to go our packets go from our ISP through some edge router, some core routers, another edge router, and then go through their network to get into the “meat” of a datacenter.
  26. What's the difference between core and edge routers?
  27. At some point, our connection ends up going into fiber. How does that happen?
  28. There must be some kind of laser. What kind? How is the signal modulated? Is it WDM or TDM? Is it single-mode or multi-mode fiber?
  29. If it's WDM, how is it muxed/demuxed? It would be pretty weird to have a prism in free space, right? This is the kind of thing an AWG could do. Is that what's actually used?
  30. There must be repeaters between links. How do repeaters work? Do they just boost the signal or do they decode it first to avoid propagating noise? If the latter, there must be DCF between repeaters.
  31. Something that just boosts the signal is the simplest case. How does an EDFA work? Is it basically just running current through doped fiber, or is there something deeper going on there?
  32. Below that level, there's the question of how standard single mode fiber and DCF work.
  33. Why do we need DCF, anyway? I guess it's cheaper to have a combination of standard fiber and DCF than to have fiber with very low dispersion. Why is that?
  34. How does fiber even work? I mean, ok, it's probably a waveguide that uses different dielectrics to keep the light contained, but what's the difference between good fiber and bad fiber?
  35. For example, hasn't fiber changed over the past couple decades to severely reduce PMD? How is that possible? Is that just more precise manufacturing, or is there something else involved?
  36. Before PMD became a problem and was solved, there was decades of work that went into increasing fiber bandwidth, vaugely analogous to the way there was decades of work that went into increasing processor performance but also completely different. What was that work and what were the blockers that work was clearing? You'd have to actually know a good deal about fiber engineering to answer this, and I don't.
  37. Going back up a few levels, we go into a datacenter. What's up there? Our packets go through a switching network to TOR to machine? What's a likely switch topology? Facebook's isn't quite something straight out of Dally and Towles, but it's the kind of thing you could imagine building with that kind of knowledge. It hasn't been long enough since FB published their topology for people to copy them, but is the idea obvious enough that you'd expect it to be independently "copied"?
  38. Wait, is that even right? Should we expect a DNS server to sit somewhere in some datacenter?
  39. In any case, after all this our DNS resolves query to an IP. We establish a connection, and then what?
  40. HTTP GET? How are HTTP 1.0 and 1.1 different? 2.0?
  41. And then we get some files back and the browser has to render them somehow. There's a request for the HTML and also for the CSS and js, and separate requests for images? This must be complicated, since browsers are complicated. I don't have any idea of the complexity of this, so there must be a lot I'm missing.
  42. After the browser renders something, how does it get to the GPU and what does the GPU do?
  43. For 2d graphics, we probably just notify the OS of... something. How does that work?
  44. And how does the OS talk to the GPU? Is there some memory mapped region where you can just paint pixels, or is it more complicated than that?
  45. How does an LCD display work? How does the connection between the monitor and the GPU work?
  46. VGA is probably the simplest possibility. How does that work?
  47. If it's a static site, I guess we're done?
  48. But if the site has ads, isn't that stuff pretty complicated? How do targeted ads and ad auctions work? A bunch of stuff somehow happens in maybe 200ms?

Where I can get answers to this stuff1? That's not a rhetorical question! I'm really interested in hearing about other resources!

Alex Gaynor set up a GitHub repo that attempts to answer this entire question. It answers some of the questions, and has answers to some questions it didn't even occur to me to ask, but it's missing answers to the vast majority of these questions.

For high-level answers, here's Tali Garsiel and Paul Irish on how a browser works and Jessica McKellar how the Internet Works. For how a simple OS does things, Xv6 has good explanations. For how Linux works, Gustavo Duarte has a series of explanations hereFor TTYs, this article by Linus Akesson is a nice supplement to Duarte's blog.

One level down from that, James Marshall has a concise explanation of HTTP 1.0 and 1.1, and SANS has an old but readable guide on SSL and TLS. This isn't exactly smooth prose, but this spec for URLs explains in great detail what a URL is.

Going down another level, MS TechNet has an explanation of TCP, which also includes a short explanation of UDP.

One more level down, Kyle Cassidy has a quick primer on Ethernet, Iljitsch van Beijnum has a lengthier explanation with more history, and Matthew J Castelli has an explanation of LAN switches. And then we have DOCSIS and cable modems. This gives a quick sketch of how long haul fiber is set up, but there must be a better explanation out there somewhere. And here's a quick sketch of modern CPUs. For an answer to the keyboard specific questions, Simon Inns explains keypress decoding and why you can't press an arbitrary combination of keys on a keyboard.

Down one more level, this explains how wires work, Richard A. Steenbergen explains fiber, and Pierret explains transistors.

P.S. As an interview question, this is pretty much the antithesis of the tptacek strategy. From what I've seen, my guess is that tptacek-style interviews are much better filters than open ended questions like this.

Thanks to Marek Majkowski, Allison Kaptur, Mindy Preston, Julia Evans, Marie Clemessy, and Gordon P. Hemsley for providing answers and links to resources with answers! Also, thanks to Julia Evans and Sumana Harihareswara for convincing me to turn these questions into a blog post.


  1. I mostly don't have questions about stuff that happens inside a PC listed, but I'm pretty curious about how modern high-speed busses work and how high-speed chips deal with the massive inductance they must have to deal with getting signals to and from the chip. [return]

Goodhearting IQ, cholesterol, and tail latency

2015-03-05 08:00:00

Most real-world problems are big enough that you can't just head for the end goal, you have to break them down into smaller parts and set up intermediate goals. For that matter, most games are that way too. “Win” is too big a goal in chess, so you might have a subgoal like don't get forked. While creating subgoals makes intractable problems tractable, it also creates the problem of determining the relative priority of different subgoals and whether or not a subgoal is relevant to the ultimate goal at all. In chess, there are libraries worth of books written on just that.

And chess is really simple compared to a lot of real world problems. 64 squares. 32 pieces. Pretty much any analog problem you can think of contains more state than chess, and so do a lot of discrete problems. Chess is also relatively simple because you can directly measure whether or not you succeeded (won). Many real-world problems have the additional problem of not being able to measure your goal directly.

IQ & Early Childhood Education

In 1962, what's now known as the Perry Preschool Study started in Ypsilanti, a blue-collar town near Detroit. It was a randomized trial, resulting in students getting either no preschool or two years of free preschool. After two years, students in the preschool group showed a 15 point bump in IQ scores; other early education studies showed similar results.

In the 60s, these promising early results spurred the creation of Head Start, a large scale preschool program designed to help economically disadvantaged children. Initial results from Head Start were also promising; children in the program got a 10 point IQ boost.

The next set of results was disappointing. By age 10, the difference in test scores and IQ between the trial and control groups wasn't statistically significant. The much larger scale Head Start study showed similar results; the authors of the first major analysis of Head Start concluded that

(1) Summer programs are ineffective in producing lasting gains in affective and cognitive development, (2) full-year programs are ineffective in aiding affective development and only marginally effective in producing lasting cognitive gains, (3) all Head Start children are still considerably below national norms on tests of language development and scholastic achievement, while school readiness at grade one approaches the national norm, and (4) parents of Head Start children voiced strong approval of the program. Thus, while full-year Head Start is somewhat superior to summer Head Start, neither could be described as satisfactory.

Education in the U.S. isn't cheap, and these early negative results caused calls for reductions in funding and even the abolishment of the program. Turns out, it's quite difficult to cut funding for a program designed to help disadvantaged children, and the program lives on despite repeated calls to cripple or kill the program.

Well after the initial calls to shut down Head Start, long-term results started coming in from the Perry preschool study. As adults, people in the experimental (preschool) group were less likely to have been arrested, less likely to have spent time in prison, and more likely to have graduated from high school. Unfortunately, due to methodological problems in the study design, it's not 100% clear where these effects come from. Although the goal was to do a randomized trial, the experimental design necessitated home visits for the experimental group. As a result, children in the experimental group whose mothers were employed swapped groups with children in the control group whose mothers were unemployed. The positive effects on the preschool group could have been caused by having at-home mothers. Since the Head Start studies weren't randomized and using instrumental variables (IVs) to tease out causation in “natural experiments” didn't become trendy until relatively recently, it took a long time to get plausible causal results from Head Start.

The goal of analyses with an instrumental variable is to extract causation, the same way you'd be able to in a randomized trial. A classic example is determining the effect of putting kids into school a year earlier or later. Some kids naturally start school a year earlier or later, but there are all sorts of factors that can cause that happen, which means that a correlation between an increased likelihood of playing college sports in kids who started school a year later could just as easily be from the other factors that caused kids to start a year later as it could be from actually starting school a year later.

However, date of birth can be used as an instrumental variable that isn't correlated with those other factors. For each school district, there's an arbitrary cutoff that causes kids on one side of the cutoff to start school a year later than kids on the other side. With the not-unreasonable assumption that being born one day later doesn't cause kids to be better athletes in college, you can see if starting school a year later seems to have a causal effect on the probability of playing sports in college.

Now, back to Head Start. One IV analysis used a funding discontinuity across counties to generate a quasi experiment. The idea is that there are discrete jumps in the level of Head Start funding across regions that are caused by variations in a continuous variable, which gives you something like a randomized trial. Moving 20 feet across the county line doesn't change much about kids or families, but it moves kids into an area with a significant change in Head Start funding.

The results of other IV analyses on Head Start are similar. Improvements in test scores faded out over time, but there were significant long-term effects on graduation rate (high school and college), crime rate, health outcomes, and other variables that are more important than test scores.

There's no single piece of incredibly convincing evidence. The randomized trial has methodological problems, and IV analyses nearly always leave some lingering questions, but the weight of the evidence indicates that even though scores on standardized tests, including IQ tests, aren't improved by early education programs, people's lives are substantially improved by early education programs. However, if you look at the early commentary on programs like Head Start, there's no acknowledgment that intermediate targets like IQ scores might not perfectly correlate with life outcomes. Instead you see declarations like “poor children have been so badly damaged in infancy by their lower-class environment that Head Start cannot make much difference”.

The funny thing about all this is that it's well known that IQ doesn't correlate perfectly to outcomes. In the range of environments that you see in typical U.S. families, to correlation to outcomes you might actually care about has an r value in the range of .3 to .4. That's incredibly strong for something in the social sciences, but even that incredibly strong statement is a statement IQ isn't responsible for "most" of the effect on real outcomes, even ignoring possible confounding factors.

Cholesterol & Myocardial Infarction

There's a long history of population studies showing a correlation between cholesterol levels and an increased risk of heart attack. A number of early studies found that lifestyle interventions that made cholesterol levels more favorable also decreased heart attack risk. And then statins were invented. Compared to older drugs, statins make cholesterol levels dramatically better and have a large effect on risk of heart attack.

Prior to the invention of statins, the standard intervention was a combination of diet and pre-statin drugs. There's a lot of literature on this; here's one typical review that finds, in randomized trials, a combination of dietary changes and drugs has a modest effect on both cholesterol levels and heart attack risk.

Given that narrative, it certainly sounds reasonable to try to develop new drugs that improve cholesterol levels, but when Pfizer spent $800 million doing exactly that, developing torcetrapib, they found that they created a drug which substantially increased heart attack risk despite improving cholesterol levels. Hoffman-La Roche's attempt fared a bit better because it improved cholesterol without killing anyone, but it still failed to decrease heart attack risk. Merck and Tricor have also had the same problem.

What happened? Some interventions that affected cholesterol levels also affected real health outcomes, prompting people to develop drugs that affect cholesterol. But it turns out that improving cholesterol isn't an inherent good, and like many intermediate targets, it's possible to improve without affecting the end goal.

99%-ile Latency & Latency

It's pretty common to see latency measurements and benchmarks nowadays. It's well understood that poor latency in applications costs you money, as it causes people to stop using the application. It's also well understood that average latency (mean, median, or mode), by itself, isn't a great metric. It's common to use 99%-ile, 99.9%-ile, 99.99%-ile, etc., in order to capture some information about the distribution and make sure that bad cases aren't too bad.

What happens when you use the 99%-iles as intermediate targets? If you require 99%-ile latency to be under 0.5 millisec and 99.99% to be under 5 millisecond you might get a latency distribution that looks something like this.

Latency graph with kinks at 99%-ile, 99.9%-ile, and 99.99%-ile

This is a graph of an actual application that Gil Tene has been showing off in his talks about latency. If you specify goals in terms of 99%-ile, 99.9%-ile, and 99.99%-ile, you'll optimize your system to barely hit those goals. Those optimizations will often push other latencies around, resulting in a funny looking distribution that has kinks at those points, with latency that's often nearly as bad as possible everywhere else.

It's is a bit odd, but there's nothing sinister about this. If you try a series of optimizations while doing nothing but looking at three numbers, you'll choose optimizations that improve those three numbers, even if they make the rest of the distribution much worse. In this case, latency rapidly degrades above the 99.99%-ile because the people optimizing literally had no idea how much worse they were making the 99.991%-ile when making changes. It's like the video game solving AI that presses pause before its character is about to get killed, because pausing the game prevents its health from decreasing. If you have very narrow optimization goals, and your measurements don't give you any visibility into anything else, everything but your optimization goals is going to get thrown out the window.

Since the end goal is usually to improve the user experience and not just optimize three specific points on the distribution, targeting a few points instead of using some kind of weighted integral can easily cause anti-optimizations that degrade the actual user experience, while producing great slideware.

In addition to the problem of optimizing just the 99%-ile to the detriment of everything else, there's the question of how to measure the 99%-ile. One method of measuring latency, used by multiple commonly used benchmarking frameworks, is to do something equivalent to

for (int i = 0; i < NUM; ++i) {
  auto a = get_time();
  do_operation();
  auto b = get_time();
  measurements[i] = b - a;
}

If you optimize the 99%-ile of that measurement, you're optimizing the 99%-ile for when all of your users get together and decide to use your app sequentially, coordinating so that no one requests anything until the previous user is finished.

Consider a contrived case where you measure for 20 seconds. For the first 10 seconds, each response takes 1ms. For the 2nd 10 seconds, the system is stalled, so the last request takes 10 seconds, resulting in 10,000 measurements of 1ms and 1 measurement of 10s. With these measurements, the 99%-ile is 1ms, as is the 99.9%-ile, for the matter. Everything looks great!

But if you consider a “real” system where users just submit requests, uniformly at random, the 75%-ile latency should be >= 5 seconds because if any query comes during the 2nd half, it will get jammed up, for an average of 5 seconds and as much as 10 seconds, in addition to whatever queuing happens because requests get stuck behind other requests.

If this example sounds contrived, it is; if you'd prefer a real world example, see this post by Nitsan Wakart, which finds shows how YCSB (Yahoo Cloud Serving Benchmark) has this problem, and how different the distributions look before and after the fix.

Order of magnitude latency differences between YCSB's measurement and the truth

The red line is YCSB's claimed latency. The blue line is what the latency looks like after Wakart fixed the coordination problem. There's more than an order of magnitude difference between the original YCSB measurement and Wakart's corrected version.

It's important to not only consider the whole distribution, to make make sure you're measuring a distribution that's relevant. Real users, which can be anything from a human clicking something on a web app, to an app that's waiting for an RPC, aren't going to coordinate to make sure they don't submit overlapping requests; they're not even going to obey a uniform random distribution.

Conclusion

This is the point in a blog post where you're supposed to get the one weird trick that solves your problem. But the only trick is that there is no trick, that you have to constantly check that your map is somehow connected to the territory1.

Resources

1990 HHS Report on Head Start. 2012 Review of Evidence on Head Start.

A short article on instrumental variables. A book on econometrics and instrumental variables.

Aysylu Greenberg video on benchmarking pitfalls; it's not latency specific, but it covers a wide variety of common errors. Gil Tene video on latency; covers many more topics than this post. Nitsan Wakart on measuring latency; has code examples and links to libraries.

Acknowledgments

Thanks to Leah Hanson for extensive comments on this, and to Scott Feeney and Kyle Littler for comments that resulted in minor edits.


  1. Unless you're in school and your professor likes to give problems where the answers are nice, simple, numbers, maybe the weird trick is that you know you're off track if you get an intermediate answer with a 170/23 in front of it. [return]

AI doesn't have to be very good to displace humans

2015-02-15 08:00:00

There's an ongoing debate over whether "AI" will ever be good enough to displace humans and, if so, when it will happen. In this debate, the optimists tend to focus on how much AI is improving and the pessimists point to all the ways AI isn't as good as an ideal human being. I think this misses one very important factor, which is that the most obvious jobs that are on the potential chopping block, such as first-line customer service, customer service for industries that are either low margin or don't care about the customer, etc., tend to be filled by apathetic humans in a poorly designed system, and humans aren't even very good at simple tasks they care a lot about. When we're apathetic, we're absolutely terrible; it's not going to take a nearly-omniscient sci-fi level AI to displace people in most customer service jobs.

For example, here's a not-too-atypical customer service interaction I had last week with a human who was significantly worse than a mediocre AI. I scheduled an appointment for an MRI. The MRI is for a jaw problem which makes it painful to talk. I was hoping that the scheduling would be easy, so I wouldn't have to spend a lot of time talking on the phone. But, as is often the case when dealing with bureaucracy, it wasn't easy.

Here are the steps it took.

  1. Have jaw pain.
  2. See dentist. Get referral for MRI when dentist determines that it's likely to be a joint problem.
  3. Dentist gets referral form from UW Health, faxes it to them according to the instructions on the form, and emails me a copy of the referral.
  4. Call UW Health.
  5. UW Health tries to schedule me for an MRI of my pituitary.
  6. Ask them to make sure there isn't an error.
  7. UW Health looks again and realizes that's a referral for something else. They can't find anything for me.
  8. Ask UW Health to call dentist to work it out. UW Health claims they cannot make phone calls.
  9. Talk to dentist again. Ask dentist to fax form again.
  10. Call UW Health again. Ask them to check again.
  11. UW Health says form is illegally filled out.
  12. Ask them to call dentist to work it out, again.
  13. UW Health says that's impossible.
  14. Ask why.
  15. UW Health says, “for legal reasons”.
  16. Realize that's probably a vague and unfounded fear of HIPAA regulations. Try asking again nicely for them to call my dentist, using different phrasing.
  17. UW Health agrees to call dentist. Hangs up.
  18. Look at referral, realize that it's actually impossible for someone outside of UW Health (like my dentist) to fill out the form legally given the instructions on the form.
  19. Talk to dentist again.
  20. Dentist agrees form is impossible, talks to UW Health to figure things out.
  21. Call UW Health to see if they got the form.
  22. UW Health acknowledges receipt of valid referral.
  23. Ask to schedule earliest possible appointment.
  24. UW Health isn't sure they can accept referrals from dentists. Goes to check.
  25. UW Health determines it is possible to accept a referral from a dentist.
  26. UW Health suggests a time on 2/17.
  27. I point out that I probably can't make it because of a conflicting appointment, also with UW Health, which I know about because I can see it on my profile with I log into the UW Health online system.
  28. UW Health suggests a time on 2/18.
  29. I point out another conflict that is in the UW Health system.
  30. UW Health starts looking for times on later dates.
  31. I ask if there are any other times available on 2/17.
  32. UW Health notices that there are other times available on 2/17 and schedules me later on 2/17.

I present this not because it's a bad case, but because it's a representative one1. In this case, my dentist's office was happy to do whatever was necessary to resolve things, but UW Health refused to talk to them without repeated suggestions that talking to my dentist would be the easiest way to resolve things. Even then, I'm not sure it helped much. This isn't even all that bad, since I was able to convince the intransigent party to cooperate. The bad cases are when both parties refuse to talk to each other and both claim that the situation can only be resolved when the other party contacts them, resulting in a deadlock. The good cases are when both parties are willing to talk to each other and work out whatever problems are necessary. Having a non-AI phone tree or web app that exposes simple scheduling would be far superior to the human customer service experience here. An AI chatbot that's a light wrapper around the API a web app would use would be worse than being able to use a normal website, but still better than human customer service. An AI chatbot that's more than a just a light wrapper would blow away the humans who do this job for UW Health.

The case against using computers instead of humans is that computers are bad at handling error conditions, can't adapt to unusual situations, and behave according to mechanical rules, which can often generate ridiculous outcomes, but that's precisely the situation we're in right now with humans. It already feels like dealing with a computer program. Not a modern computer program, but a compiler from the 80s that tells you that there's at least one error, with no other diagnostic information.

UW Health sent a form with impossible instructions to my dentist. That's not great, but it's understandable; mistakes happen. However, when they got the form back and it wasn't correctly filled out, instead of contacting my dentist they just threw it away. Just like an 80s compiler. Error! The second time around, they told me that the form was incorrectly filled out. Error! There was a human on the other end who could have noted that the form was impossible to fill out. But like an 80s compiler, they stopped at the first error and gave it no further thought. This eventually got resolved, but the error messages I got along the way were much worse than I'd expect from a modern program. Clang (and even gcc) give me much better error messages than I got here.

Of course, as we saw with healthcare.gov, outsourcing interaction to computers doesn't guarantee good results. There are some claims that market solutions will automatically fix any problem, but those claims don't always work out.

Seeking AdSense Googler. Need AdSense help. My emails remain unanswered. Are you the special Googler who will help?

That's an ad someone was running for a few months on Facebook in order to try to find a human at Google to help them because every conventional technique they had at their disposal failed. Google has perhaps the most advanced ML in the world, they're as market driven as any other public company, and they've mostly tried to automate away service jobs like first-level support because support doesn't scale. As a result, the most reliable methods of getting support at Google are

  1. Be famous enough that a blog post or tweet will get enough attention to garner a response.
  2. Work at Google or know someone who works at Google and is willing to not only file an internal bug, but to drive it to make sure it gets handled.

If you don't have direct access to one of these methods, running an ad is actually a pretty reasonable solution. (1) and (2) don't always work, but they're more effective than not being famous and hoping a blog post will hit HN, or being a paying customer. The point here isn't to rag on Google, it's just that automated customer service solutions aren't infallible, even when you've got an AI that can beat the strongest go player in the world and multiple buildings full of people applying that same technology to practical problems.

While replacing humans with computers doesn't always create a great experience, good computer based systems for things like scheduling and referrals can already be much better than the average human at a bureaucratic institution2. With the right setup, a computer-based system can be better at escalating thorny problems to someone who's capable of solving them than a human-based system. And computers will only get better at this. There will be bugs. And there will be bad systems. But there are already bugs in human systems. And there are already bad human systems.

I'm not sure if, in my lifetime, technology will advance to the point where computers can be as good as helpful humans in a well designed system. But we're already at the point where computers can be as helpful as apathetic humans in a poorly designed system, which describes a significant fraction of service jobs.

2023 update

When ChatGPT was released in 2022, the debate described above in 2015 happened again, with the same arguments on both sides. People are once again saying that AI (this time, ChatGPT and LLMs) can't replace humans because a great human is better than ChatGPT. They'll often pick a couple examples of ChatGPT saying something extremely silly, "hallucinating", but if you ask a human to explain something, even a world-class expert, they often hallucinate a totally fake explanation as well

Many people on the pessimist side argued that it would be decades before LLMs can replace humans for the exact reasons we noted were false in 2015. Everyone made this argument after multiple industries had massive cuts in the number of humans they need to employ due to pre-LLM "AI" automation and many of these people even made this argument after companies had already laid people off and replaced people with LLMs. I commented on this at the time, using the same reasoning I used in this 2015 post before realizing that I'd already written down this line of reasoning in 2015. But, cut me some slack; I'm just a human, not a computer, so I have a fallible memory.

Now that it's been a year ChatGPT was released, the AI pessimists who argued that LLMs would displace human jobs for a very long time have been proven even more wrong by layoff after layoff where customer service orgs were cut to the bone and mostly replaced by AI, AI customer service seems quite poor, just like human customer service. But human customer service isn't improving, while AI customer service is. For example, here are some recent customer service interactions I had as a result of bringing my car in to get the oil changed, rotate the tires, and do a third thing (long story).

  1. I call OkTire3 and ask if they can do the three things I want with my car
  2. They say yes
  3. I ask if I can just drop by or if I need to make an appointment
  4. They say I can just drop by
  5. I ask if I can talk to the service manager directly to get some more info
  6. After being transferred to the service manager, I describe what I want again and ask when I can come in
  7. They say that will take a lot of time and I'll need to make an appointment. They can get me in next week. If I listened to the first guy, I would've had a completely pointless one-hour round trip drive since they couldn't, in fact, do the work I wanted as a drop-in
  8. A week later, I bring the car in and talk to someone at the desk, who asks me what I need done
  9. I describe what I need and notice that he only writes down about 1/3 of what I said, so I follow up and
  10. ask what oil they're going to use
  11. The guy says "we'll use the right oil"
  12. I tell him that I want 0W-20 synthetic because my car has a service bulletin indicating that this is recommended, which is different from the label on the car, so could they please note this.
  13. The guy repeats "we'll use the right oil".
  14. (12) again, with slightly different phrasing
  15. (13) again, with slightly different phrasing
  16. (12) again, with slightly different phrasing
  17. The guy says, "it's all in the computer, the computer has the right oil".
  18. I ask him what oil the computer says to use
  19. Annoyed, the guy walks over to the computer and pull up my car, telling me that my car should use 5W-30
  20. I tell him that's not right for my vehicle due to the service bulletin and I want 0W-20 synthetic
  21. The guy, looking shocked, says "Oh", and then looks at the computer and says "oh, it says we can also use 0W-20"
  22. The guy writes down 0W-20 on the sheet for my car
  23. I leave, expecting that the third thing I asked for won't be done or won't completely be done since it wasn't really written down
  24. The next day, I pick up my car and they fully didn't do the third thing.

Overall, how does an LLM compare? It's probably significantly better than this dude, who acted like an archetypical stoner who doesn't want to be there and doesn't want to do anything, and the LLM will be cheaper as well. However, the LLM will be worse than a web interface that lets me book the exact work I want and write a note to the tech who's doing the work. For better or for worse, I don't think my local tire / oil change place is going to give me a nice web interface that lets me book the exact work I want any time soon, so this guy is going to be replaced by an LLM and not a simple web app.

Elsewhere

Thanks to Leah Hanson and Josiah Irwin for comments/corrections/discussion.


  1. Representative of my experience in Madison, anyway. The absolute worst case of this I encountered in Austin isn't even as bad as the median case I've seen in Madison. YMMV. [return]
  2. I wonder if a deranged version of the law of one price applies, the law of one level of customer service. However good or bad an organization is at customer service, they will create or purchase automated solutions that are equally good or bad.

    At Costco, the checkout clerks move fast and are helpful, so you don't have much reason to use the automated checkout. But then the self-checkout machines tend to be well-designed; they're physically laid out to reduce the time it takes to feed a large volume of stuff through them, and they rarely get confused and deadlock, so there's not much reason not to use them. At a number of other grocery chains, the checkout clerks are apathetic and move slowly, and will make mistakes unless you remind them of what's happening. It makes sense to use self-checkout at those places, except that the self-checkout machines aren't designed particularly well and are often configured so that they often get confused and require intervention from an overloaded checkout clerk.

    The same thing seems to happen with automated phone trees, as well as both of the examples above. Local Health has an online system to automate customer service, but they went with Epic as the provider, and as a result it's even worse than dealing with their phone support. And it's possible to get a human on the line if you're a customer on some Google products, but that human is often no more helpful than the automated system you'd otherwise deal with.

    [return]
  3. BTW, this isn't a knock against OkTire. I used OkTire because they're actually above average! I've also tried the local dealership, which is fine but super expensive, and a widely recommended indepenent Volvo specialist, Xc Auto, which was much worse — they did sloppy work and missed important issues and were sloppy elsewhere as well; they literally forgot to order parts for the work they were going to do (a mistake an AI probably wouldn't have made), so I had to come back another day to finish the work on my car! [return]

CPU backdoors

2015-02-03 08:00:00

It's generally accepted that any piece of software could be compromised with a backdoor. Prominent examples include the Sony/BMG installer, which had a backdoor built-in to allow Sony to keep users from copying the CD, which also allowed malicious third-parties to take over any machine with the software installed; the Samsung Galaxy, which has a backdoor that allowed the modem to access the device's filesystem, which also allows anyone running a fake base station to access files on the device; Lotus Notes, which had a backdoor which allowed encryption to be defeated; and Lenovo laptops, which pushed all web traffic through a proxy (including HTTPS, via a trusted root certificate) in order to push ads, which allowed anyone with the correct key (which was distributed on every laptop) to intercept HTTPS traffic.

Despite sightings of backdoors in FPGAs and networking gear, whenever someone brings up the possibility of CPU backdoors, it's still common for people to claim that it's impossible. I'm not going to claim that CPU backdoors exist, but I will claim that the implementation is easy, if you've got the right access.

Let's say you wanted to make a backdoor. How would you do it? There are three parts to this: what could a backdoored CPU do, how could the backdoor be accessed, and what kind of compromise would be required to install the backdoor?

Starting with the first item, what does the backdoor do? There are a lot of possibilities. The simplest is to allow privilege escalation: make the CPU to transition from ring3 to ring0 or SMM, giving the running process kernel-level privileges. Since it's the CPU that's doing it, this can punch through both hardware and software virtualization. There are a lot of subtler or more invasive things you could do, but privilege escalation is both simple enough and powerful enough that I'm not going to discuss the other options.

Now that you know what you want the backdoor to do, how should it get triggered? Ideally, it will be something that no one will run across by accident, or even by brute force, while looking for backdoors. Even with that limitation, the state space of possible triggers is huge.

Let's look at a particular instruction, fyl2x1. Under normal operation, it takes two floating point registers as input, giving you 2*80=160 bits to hide a trigger in. If you trigger the backdoor off of a specific pair of values, that's probably safe against random discovery. If you're really worried about someone stumbling across the backdoor by accident, or brute forcing a suspected backdoor, you can check more than the two normal input registers (after all, you've got control of the CPU).

This trigger is nice and simple, but the downside is that hitting the trigger probably requires executing native code since you're unlikely to get chrome or Firefox to emit an fyl2x instruction. You could try to work around that by triggering off an instruction you can easily get a JavaScript engine to emit (like an fadd). The problem with that is that if you patch an add instruction and add some checks to it, it will become noticeably slower (although, if you can edit the hardware, you should be able to do it with no overhead). It might be possible to create something hard to detect that's triggerable through JavaScript by patching a rep string instruction and doing some stuff to set up the appropriate “key” followed by a block copy, or maybe idiv. Alternately, if you've managed to get a copy of the design, you can probably figure out a way to use debug logic triggers2 or performance counters to set off a backdoor when some arbitrary JavaScript gets run.

Alright, now you've got a backdoor. How do you insert the backdoor? In software, you'd either edit the source or the binary. In hardware, if you have access to the source, you can edit it as easily as you can in software. The hardware equivalent of recompiling the source, creating physical chips, has tremendously high fixed costs; if you're trying to get your changes into the source, you'll want to either compromise the design3 and insert your edits before everything is sent off to get manufactured, or compromise the manufacturing process and sneak in your edits at the last second4.

If that sounds too hard, you could try compromising the patch mechanism. Most modern CPUs come with a built-in patch mechanism to allow bug fixes after the fact. It's likely that the CPU you're using has been patched, possibly from day one, and possibly as part of a firmware update. The details of the patch mechanism for your CPU are a closely guarded secret. It's likely that the CPU has a public key etched into it, and that it will only accept a patch that's been signed by the right private key.

Is this actually happening? I have no idea. Could it be happening? Absolutely. What are the odds? Well, the primary challenge is non-technical, so I'm not the right person to ask about that. If I had to guess, I'd say no, if for no other reason than the ease of subverting other equipment.

I haven't discussed how to make a backdoor that's hard to detect even if someone has access to software you've used to trigger a backdoor. That's harder, but it should be possible once chips start coming with built-in TPMs.

If you liked this post, you'll probably enjoy this post on CPU bugs and might be interested in this post about new CPU features over the past 35 years.

Updates

See this twitter thread for much more discussion, some of which is summarized below.

I'm not going to provide individual attributions because there are too many comments, but here's a summary of comments from @hackerfantastic, Arrigo Triulzi, David Kanter, @solardiz, @4Dgifts, Alfredo Ortega, Marsh Ray, and Russ Cox. Mistakes are my own, of course.

AMD's K7 and K8 had their microcode patch mechanisms compromised, allowing for the sort of attacks mentioned in this post. Turns out, AMD didn't encrypt updates or validate them with a checksum, which lets you easily modify updates until you get one that does what you want.

Here's an example of a backdoor that was created for demonstration purposes, by Alfredo Ortega.

For folks without a hardware background, this talk on how to implement a CPU in VHDL is nice, and it has a section on how to implement a backdoor.

Is it possible to backdoor RDRAND by providing bad random results? Yes. I mentioned that in my first draft of this post, but I got rid of it since my impression was that people don't trust RDRAND and mix the results other sources of entropy. That doesn't make a backdoor useless, but it significantly reduces the value.

Would it be possible to store and dump AES-NI keys? It's probably infeasible to sneak flash memory onto a chip without anyone noticing, but modern chips have logic analyzer facilities that let you store and dump data. However, access to those is through some secret mechanism and it's not clear how you'd even get access to binaries that would let you reverse engineer their operation. That's in stark contrast to the K8 reverse engineering, which was possible because microcode patches get included in firmware updates.

It would be possible to check instruction prefixes for the trigger. x86 lets you put redundant (and contradictory) instruction prefixes on instructions. Which prefixes get used are well defined, so you can add as many prefixes as you want without causing problems (up to the prefix length limit). The issues with this are that it's probably hard to do without sacrificing performance with a microcode patch, the limited number of prefixes and the length limit mean that your effective key size is relatively small if you don't track state across multiple instructions, and that you can only generate the trigger with native code.

As far as anyone knows, this is all speculative, and no one has seen an actual CPU backdoor being used in the wild.

Acknowledgments

Thanks to Leah Hanson for extensive comments, to Aleksey Shipilev and Joe Wilder for suggestions/corrections, and to the many participants in the twitter discussion linked to above. Also, thanks to Markus Siemens for noticing that a bug in some RSS readers was causing problems, and for providing the workaround. That's not really specific to this post, but it happened to come up here.


  1. This choice of instruction is somewhat, but not completely, arbitrary. You'll probably want an instruction that's both slow and microcoded, to make it easy to patch with a microcode patch without causing a huge performance hit. The rest of this footnote is about what it means for an instruction to be microcoded. It's quite long and not in the critical path of this post, so you might want to skip it.

    The distinction between a microcoded instruction and one that's implemented in hardware is, itself, somewhat arbitrary. CPUs have an instruction set they implement, which you can think of as a public API. Internally, they can execute a different instruction set, which you can think of as a private API.

    On modern Intel chips, instructions that turn into four (or fewer) uops (private API calls) are translated into uops directly by the decoder. Instructions that result in more uops (anywhere from five to hundreds or possibly thousands) are decoded via a microcode engine that reads uops out of a small ROM or RAM on the CPU. Why four and not five? That's a result of some tradeoffs, not some fundamental truth. The terminology for this isn't standardized, but the folks I know would say that an instruction is “microcoded” if its decode is handled by the microcode engine and that it's “implemented in hardware” if its decode is handled by the standard decoder. The microcode engine is sort of its own CPU, since it has to be able to handle things like reading and writing from temporary registers that aren't architecturally visible, reading and writing from internal RAM for instructions that need more than just a few registers of scratch space, conditional microcode branches that change which microcode the microcode engine fetches and decodes, etc.

    Implementation details vary (and tend to be secret). But whatever the implementation, you can think of the microcode engine as something that loads a RAM with microcode when the CPU starts up, which then fetches and decodes microcoded instructions out of that RAM. It's easy to modify what microcode gets executed by changing what gets loaded on boot via a microcode patch.

    For quicker turnaround while debugging, it's somewhere between plausible and likely that Intel also has a mechanism that lets them force non-microcoded instructions to execute out of the microcode RAM in order to allow them to be patched with a microcode patch. But even if that's not the case, compromising the microcode patch mechanism and modifying a single microcoded instruction should be sufficient to install a backdoor.

    [return]
  2. For the most part, these aren't publicly documented, but you can get a high-level overview of what kind of debug triggers Intel was building into their chips a couple generators ago starting at page 128 of Intel Technology Journal, Volume 4, Issue 3. [return]
  3. For the past couple years, there's been a debate over whether or not major corporations have been compromised and whether such a thing is even possible. During the cold war, government agencies on all sides were compromised at various levels for extended periods of time, despite having access to countermeasures not available to any corporations today (not hiring citizens of foreign countries, "enhanced interrogation techniques", etc.). I'm not sure that we'll ever know if companies are being compromised, but it would certainly be easier to compromise a present-day corporation than it was to compromise government agencies during the cold war, and that was eminently doable. Compromising a company enough to get the key to the microcode patch is trivial compared to what was done during the cold war. [return]
  4. This is another really long footnote about minutia! In particular, it's about the manufacturing process. You might want to skip it! If you don't, don't say I didn't warn you.

    It turns out that editing chips before manufacturing is fully complete is relatively easy, by design. To explain why, we'll have to look at how chips are made.

    Cross section of Intel chip, 22nm process

    When you look at a cross-section of a chip, you see that silicon gates are at the bottom, forming logical primitives like nand gates, with a series of metal layers above (labeled M1 through M8), forming wires that connect different gates. A cartoon model of the manufacturing process is that chips are built from the bottom up, one layer a time, where each layer is created by depositing some material and then etching part of it away using a mask, in a process that's analogous to lithographic printing. The non-cartoon version involves a lot of complexity -- Todd Fernendez estimates that it takes about 500 steps to create the layers below “M1”. Additionally, the level of precision needed is high enough that the light used to etch causes enough wear in the equipment that it wears out. You probably don't normally think about lenses wearing out due to light passing through them, but at the level of precision required for each of the hundreds of steps required to make a transistor, it's a serious problem. If that sounds surprising to you, you're not alone. An ITRS roadmap from the 90s predicted that by 2016, we'd be at almost 30GHz (higher is better) on a 9nm process (smaller is better), with chips consuming almost 300 watts. Instead, 5 GHz is considered pretty fast, and anyone who isn't Intel will be lucky to get high-yield production on a 14nm process by the start of 2016. Making chips is harder than anyone guessed it would be.

    A modern chip has enough layers that it takes about three months to make one, from start to finish. This makes bugs very bad news since a bug fix that requires a change to one of the bottom layers takes three months to manufacture. In order to reduce the turnaround time on bug fixes, it's typical to scatter unused logic gates around the silicon, to allow small bug fixes to be done with an edit to a few layers that are near the top. Since chips are made in a manufacturing line process, at any point in time, there are batches of partially complete chips. If you only need to edit one of the top metal layers, you can apply the edit to a partially finished chip, cutting the turnaround time down from months to weeks.

    Since chips are designed to allow easy edits, someone with access to the design before the chip is manufactured (such as the manufacturer) can make major changes with relatively small edits. I suspect that if you were to make this comment to anyone at a major CPU company, they'd tell you it's impossible to do this without them noticing because it would get caught in characterization or when they were trying to find speed paths or something similar. One would hope, but actual hardware devices have shipped with backdoors, and either no one noticed, or they were complicit.

    [return]

Blog monetization

2015-01-24 08:00:00

Does it make sense for me to run ads on my blog? I've been thinking about this lately, since Carbon Ads contacted me about putting an ad up. What are the pros and cons? This isn't a rhetorical question. I'm genuinely interested in what you think.

Pros

Money

Hey, who couldn't use more money? And it's basically free money. Well, except for the all of the downsides.

Data

There's lots of studies on the impact of ads on site usage and behavior. But as with any sort of benchmarking, it's not really clear how or if that generalizes to other sites if you don't have a deep understanding of the domain, and I have almost no understanding of the domain. If I run some ads and do some A/B testing I'll get to see what the effect is on my site, which would be neat.

Cons

Money

It's not enough money to make a living off of, and it's never going to be. When Carbon contacted me, they asked me how much traffic I got in the past 30 days. At the time, Google Analytics showed 118k sessions, 94k users, 143k page views. Cloudflare tends to show about 20% higher traffic since 20% of people block Google Analytics, but those 20% plus more probably block ads, so the "real" numbers aren't helpful here. I told them that, but I also told them that those numbers were pretty unusual and that I'd expect to average much less traffic.

How much money is that worth? I don't know if the CPM (cost per thousand impressions) numbers they gave me are confidential, so I'll just use a current standard figure of $1 CPM. If my traffic continued at that rate, that would be $143/month, or $1,700/year. Ok, that's not too bad.

Distribution of traffic on this blog. About 500k hits total.

But let's look at the traffic since I started this blog. I didn't add analytics until after a post of mine got passed around on HN and reddit, so this isn't all of my traffic, but it's close.

For one thing, the 143k hits over a 30-day period seems like a fluke. I've never had a calendar month with that much traffic. I just happen to have a traffic distribution which turned up a bunch of traffic over a specific 30-day period.

Also, if I stop blogging, as I did from April to October, my traffic level drops to pretty much zero. And even if I keep blogging, it's not really clear what my “natural” traffic level is. Is the level before I paused my blogging the normal level or the level after? Either way, $143/month seems like a good guess for an upper bound. I might exceed that, but I doubt it.

For a hard upper bound, let's look at one of the most widely read programming blogs, Coding Horror. Jeff Atwood is nice enough to make his traffic stats available. Thanks Jeff!

Distribution of traffic on Coding Horror. 1.7M hits in a month at its peak.

He got 1.7M hits in his best month, and 1.25M wouldn't be a bad month for him, even when he was blogging regularly. With today's CPM rates, that's $1.7k/month at his peak and $1.25k/month for a normal month.

But Jeff Atwood blogs about general interest programming topics, like Markdown and Ruby and I blog about obscure stuff, like why Intel might want to add new instructions to speed up non-volatile storage with the occasional literature review for variety. There's no way I can get as much traffic as someone who blogs about more general interest topics; I'd be surprised if I could even get within a factor of 2, so $600/month seems like a hard and probably unreachable upper bound for sustainable income.

That's not bad. After taxes, that would have approximately covered my rent when I lived in Austin, and could have covered rent + utilities and other expenses if I'd had a roommate. But the wildly optimistic success rate is that you barely cover rent when the programming job market is hot enough that mid-level positions at big companies pay out total compensation that's 8x-9x the median income in the U.S. That's not good.

Worse yet, this is getting worse over time. CPM is down something like 5x since the 90s, and continues to decline. Meanwhile, the percentage of people using ad blockers continues to increase.

Premium ads can get well over an order of magnitude higher CPM and sponsorships can fetch an ever better return, so the picture might not be quite as bleak as I'm making it out to be. But to get premium ads you need to appeal to specific advertisers. What advertisers are interested in an audience that's mostly programmers with an interest in low-level shenanigans? I don't know, and I doubt it's worth the effort to find out unless I can get to Jeff Atwood levels of traffic, which I find unlikely.

A Tangent on Alexa Rankings

What's up with Alexa? Why do so many people use it as a gold standard? In theory, it's supposed to show how popular a site was over the past three months. According to Alexa, Coding Horror is ranked at 22k and I'm at 162k. My understanding is that traffic is more than linear in rank so you'd expect Coding Horror to have substantially more than 7x the traffic that I do. But if you compare Jeff's stats to mine over the past three months (Oct 21 - Jan 21), statcounter claims he's had 78k hits compared to my 298k hits. Even if you assume that traffic is merely linear in Alexa rank, that's a 28x difference in relative traffic between the direct measurement and Alexa's estimate.

I'm not claiming that my blog is more popular in any meaningful sense -- if Jeff posted as often as I did in the past three months, I'm sure he'd have at least 10x more traffic than me. But given that Jeff now spends most of his time on non-blogging activities and that his traffic is at the level it's at when he rarely blogs, the Alexa ranks for our sites seem way off.

Moreover, the Alexa sub-metrics are inconsistent and nonsensical. Take this graph on the relative proportion of users who use this site from home, school, or work.

Relatively below average, at everything!

It's below average in every category, which should be impossible for a relative ranking like this. But even mathematical impossibility doesn't stop Alexa!

Traffic

Ads reduce traffic. How much depends both on the site and the ads. I might do a literature review some other time, but for now I'm just going to link to this single result by Daniel G. Goldstein, Siddharth Suri, R. Preston McAfee, Matthew Ekstrand-Abueg, and Fernando Diaz that attempts to quantify the cost.

My point isn't that some specific study applies to adding a single ad to my site, but that it's well known that adding ads reduces traffic and has some effect on long-term user behavior, which has some cost.

It's relatively easy to quantify the cost if you're looking at something like the study above, which compares “annoying” ads to “good” ads to see what the cost of the “annoying” ads are. It's harder to quantify for a personal blog where the baseline benefit is non-monetary.

What do I get out of this blog, anyway? The main benefits I can see are that I've met and regularly correspond with some great people I wouldn't have otherwise met, that I often get good feedback on my ideas, and that every once in a while someone pings me about a job that sounds interesting because they saw a relevant post of mine.

I doubt I can effectively estimate the amount of traffic I'll lose, and even if I could, I doubt I could figure out the relationship between that and the value I get out of blogging. My gut says that the value is “a lot” and that the monetary payoff is probably “not a lot”, but it's not clear what that means at the margin.

Incentives

People are influenced by money, even when they don't notice it. I'm people. I might do something to get more revenue, even though the dollar amount is small and I wouldn't consciously spend a lot of effort of optimizing things to get an extra $5/month.

What would that mean here? Maybe I'd write more blog posts? When I experimented with blurting out blog posts more frequently, with less editing, I got uniformly positive feedback, so maybe being incentivized to write more wouldn't be so bad. But I always worry about unconscious bias and I wonder what other effects running ads might have on me.

Privacy

Ad networks can track people through ads. My impression is that people are mostly concerned with really big companies that have enough information that they could deanonymize people if they were so inclined, like Google and Facebook, but some people are probably also concerned about smaller ad networks like Carbon. Just as an aside, I'm curious if companies that attempt to do lots of tracking, like Tapad and MediaMath actually have more data on people than better known companies like Yahoo and eBay. I doubt that kind of data is publicly available, though.

Paypal

This is specific to Carbon, but they pay out through PayPal, which is notorious for freezing funds for six months if you get enough money that you'd actually want the money, and for pseudo-randomly draining your bank account due to clerical errors. I've managed to avoid hooking my PayPal account up to my bank account so far, but I'll have to either do that or get money out through an intermediary if I end up making enough money that I want to withdraw it.

Conclusion

Is running ads worth it? I don't know. If I had to guess, I'd say no. I'm going to try it anyway because I'm curious what the data looks like, and I'm not going to get to see any data if I don't try something, but it's not like that data will tell me whether or not it was worth it.

At best, I'll be able to see a difference in click-through rates on my blog with and without ads. This blog mostly spreads through word of mouth, so what I really want to see is the difference in the rate at which the blog gets shared with other people, but I don't see a good way to do that. I could try globally enabling or disabling ads for months at a time, but the variance between months is so high that I don't know that I'd get good data out of that even if I did it for years.

Thanks to Anja Boskovic for comments/corrections/discussion.

Update

After running an ads for a while, it looks like about 40% of my traffic uses an ad blocker (whereas about 17% of my traffic blocks Google Analytics). I'm not sure if I should be surprised that the number is so high or that it's so low. On the one hand, 40% is a lot! On the other hand, despite complaints that ad blockers slow down browsers, my experience has been that web pages load a lot faster when I'm blocking ads using the right ad blocker and I don't see any reason not to use an blocker. I'd expect that most of my traffic comes from programmers, who all know that ad blocking is possible.

There's the argument that ad blocking is piracy and/or stealing, but I've never heard a convincing case made. If anything, I think that some of the people who make that argument step over the line, as when ars technica blocked people who used ad blockers, and then backed off and merely exhorted people to disable ad blocking for their site. I think most people would agree that directly exhorting people to click on ads and commit click fraud is unethical; asking people to disable ad blocking is a difference in degree, not in kind. People who use ad blockers are much less likely to click on ads, so having them disable ad blockers to generate impressions that are unlikely to convert strikes me as pretty similar to having people who aren't interested in the product generate clicks.

Anyway, I ended up removing this ad after they failed to send a payment after the first payment. AdSense is rumored to wait until just before payment before cutting people off, to get as many impressions as possible for free, but AdSense at least notifies you about it. Carbon just stopped paying without saying anything, while still running the ad. I could probably ask someone at Carbon or BuySellAds about it, but considering how little the ad is worth, it's not really worth the hassle of doing that.

Update 2

It's been almost two years since I said that I'd never get enough traffic for blogging to be able to cover my living expenses. It turns out that's not true! My reasoning was that I mostly tend to blog about low-level technical topics, which can't possibly generate enough traffic to generate "real" ad revenue. That reason is still as valid as ever, but my blogging is now approximately half low-level technical stuff, and half general-interest topics for programmers.

Traffic for one month on this blog in 2016. Roughly 3.1M hits.

Here's a graph of my traffic for the past 30 days (as of October 25th, 2016). Since this is Cloudflare's graph of requests, this would wildly overestimate traffic for most sites, because each image and CSS file is one request. However, since the vast majority of my traffic goes to pages with no external CSS and no images, this is pretty close to my actual level of traffic. 15% of the requests are images, and 10% is RSS (which I won't count because the rate of RSS hits is hard to correlation to the rate of actual people reading). But that means that 75% of the traffic appears to be "real", which puts the traffic into this site at roughly 2.3M hits per month. At a typical $1 ad CPM, that's $2.3k/month, which could cover my share of household expenses.

Additionally, when I look at blogs that really try to monetize their traffic, they tend to monetize at a much better rate. For example, Slate Star Codex charges $1250 for 6 months of ads and appears to be running 8 ads, for a total of $20k/yr. The author claims to get "10,000 to 20,000 impressions per day", or roughly 450k hits per month. I get about 5x that much traffic. If we scale that linearly, that might be $100k/yr instead of $20k/yr. One thing that I find interesting is that the ads on Slate Star Codex don't get blocked by my ad blocker. It seems like that's because the author isn't part of some giant advertising program, and ad blockers don't go out of their way to block every set of single-site custom ads out there. I'm using Slate Star Codex as an example because I think it's not super ad optimized because I doubt I would optimize my ads much if I ran ads.

This is getting to the point where it seems a bit unreasonable not to run ads (I doubt the non-direct value I get out of this blog can consistently exceed $100k/yr). I probably "should" run ads, but I don't think the revenue I get from something like AdSense or Carbon is really worth it, and it seems like a hassle to run my own ad program the way Slate Star Codex does. It seems totally irrational to leave $90k/yr on the table because "it seems like a hassle", but here we are. I went back and added affiliate code to all of my Amazon links, but if I'm estimating Amazon's payouts correctly, that will amount to less than $100/month.

I don't think it's necessarily more irrational than behavior I see from other people -- I regularly talk to people who leave $200k/yr or more on the table by working for startups instead of large companies, and that seems like a reasonable preference to me. They make "enough" money and like things the way they are. What's wrong with that? So why can't not running ads be a reasonable preference? It still feels pretty unreasonable to me, though! A few people have suggested crowdfunding, but the top earning programmers have at least an order of magnitude more exposure than I do and make an order of magnitude less than I could on ads (folks like Casey Muratori, ESR, and eevee are pulling in around $1000/month).

Update 3

I'm now trying donations via Patreon. I suspect this won't work, but I'd be happy to be wrong!

What's new in CPUs since the 80s?

2015-01-11 08:00:00

This is a response to the following question from David Albert:

My mental model of CPUs is stuck in the 1980s: basically boxes that do arithmetic, logic, bit twiddling and shifting, and loading and storing things in memory. I'm vaguely aware of various newer developments like vector instructions (SIMD) and the idea that newer CPUs have support for virtualization (though I have no idea what that means in practice).

What cool developments have I been missing? What can today's CPU do that last year's CPU couldn't? How about a CPU from two years ago, five years ago, or ten years ago? The things I'm most interested in are things that programmers have to manually take advantage of (or programming environments have to be redesigned to take advantage of) in order to use and as a result might not be using yet. I think this excludes things like Hyper-threading/SMT, but I'm not honestly sure. I'm also interested in things that CPUs can't do yet but will be able to do in the near future.

Everything below refers to x86 and linux, unless otherwise indicated. History has a tendency to repeat itself, and a lot of things that were new to x86 were old hat to supercomputing, mainframe, and workstation folks.

The Present

Miscellania

For one thing, chips have wider registers and can address more memory. In the 80s, you might have used an 8-bit CPU, but now you almost certainly have a 64-bit CPU in your machine. I'm not going to talk about this too much, since I assume you're familiar with programming a 64-bit machine. In addition to providing more address space, 64-bit mode provides more registers and more consistent floating point results (via the avoidance of pseudo-randomly getting 80-bit precision for 32 and 64 bit operations via x87 floating point). Other things that you're very likely to be using that were introduced to x86 since the early 80s include paging / virtual memory, pipelining, and floating point.

Esoterica

I'm also going to avoid discussing things that are now irrelevant (like A20M) and things that will only affect your life if you're writing drivers, BIOS code, doing security audits, or other unusually low-level stuff (like APIC/x2APIC, SMM, NX, or SGX).

Memory / Caches

Of the remaining topics, the one that's most likely to have a real effect on day-to-day programming is how memory works. My first computer was a 286. On that machine, a memory access might take a few cycles. A few years back, I used a Pentium 4 system where a memory access took more than 400 cycles. Processors have sped up a lot more than memory. The solution to the problem of having relatively slow memory has been to add caching, which provides fast access to frequently used data, and prefetching, which preloads data into caches if the access pattern is predictable.

A few cycles vs. 400+ cycles sounds really bad; that's well over 100x slower. But if I write a dumb loop that reads and operates on a large block of 64-bit (8-byte) values, the CPU is smart enough to prefetch the correct data before I need it, which lets me process at about 22 GB/s on my 3GHz processor. A calculation that can consume 8 bytes every cycle at 3GHz only works out to 24GB/s, so getting 22GB/s isn't so bad. We're losing something like 8% performance by having to go to main memory, not 100x.

As a first-order approximation, using predictable memory access patterns and operating on chunks of data that are smaller than your CPU cache will get you most of the benefit of modern caches. If you want to squeeze out as much performance as possible, this document is a good starting point. After digesting that 100 page PDF, you'll want to familiarize yourself with the microarchitecture and memory subsystem of the system you're optimizing for, and learn how to profile the performance of your application with something like likwid.

TLBs

There are lots of little caches on the chip for all sorts of things, not just main memory. You don't need to know about the decoded instruction cache and other funny little caches unless you're really going all out on micro-optimizations. The big exception is the TLBs, which are caches for virtual memory lookups (done via a 4-level page table structure on x86). Even if the page tables were in the l1-data cache, that would be 4 cycles per lookup, or 16 cycles to do an entire virtual address lookup each time around. That's totally unacceptable for something that's required for all user-mode memory accesses, so there are small, fast, caches for virtual address lookups.

Because the first level TLB cache has to be fast, it's severely limited in size (perhaps 64 entries on a modern chip). If you use 4k pages, that limits the amount of memory you can address without incurring a TLB miss. x86 also supports 2MB and 1GB pages; some applications will benefit a lot from using larger page sizes. It's something worth looking into if you've got a long-running application that uses a lot of memory.

Also, first-level caches are usually limited by the page size times the associativity of the cache. If the cache is smaller than that, the bits used to index into the cache are the same regardless if whether you're looking at the virtual address or the physical address, so you don't have to do a virtual to physical translation before indexing into the cache. If the cache is larger than that, you have to first do a TLB lookup to index into the cache (which will cost at least one extra cycle), or build a virtually indexed cache (which is possible, but adds complexity and coupling to software). You can see this limit in modern chips. Haswell has an 8-way associative cache and 4kB pages. Its l1 data cache is 8 * 4kB = 32kB.

Out of Order Execution / Serialization

For a couple decades now, x86 chips have been able to speculatively execute and re-order execution (to avoid blocking on a single stalled resource). This sometimes results in odd performance hiccups. But x86 is pretty strict in requiring that, for a single CPU, externally visible state, like registers and memory, must be updated as if everything were executed in order. The implementation of this involves making sure that, for any pair of instructions with a dependency, those instructions execute in the correct order with respect to each other.

That restriction that things look like they executed in order means that, for the most part, you can ignore the existence of OoO execution unless you're trying to eke out the best possible performance. The major exceptions are when you need to make sure something not only looks like it executed in order externally, but actually executed in order internally.

An example of when you might care would be if you're trying to measure the execution time of a sequence of instructions using rdtsc. rdtsc reads a hidden internal counter and puts the result into edx and eax, externally visible registers.

Say we do something like

foo
rdtsc
bar
mov %eax, [%ebx]
baz

where foo, bar, and baz don't touch eax, edx, or [%ebx]. The mov that follows the rdtsc will write the value of eax to some location in memory, and because eax is an externally visible register, the CPU will guarantee that the mov doesn't execute until after rdtsc has executed, so that everything looks like it happened in order.

However, since there isn't an explicit dependency between the rdtsc and either foo or bar, the rdtsc could execute before foo, between foo and bar, or after bar. It could even be the case that baz executes before the rdtsc, as long as baz doesn't affect the move instruction in any way. There are some circumstances where that would be fine, but it's not fine if the rdtsc is there to measure the execution time of foo.

To precisely order the rdtsc with respect to other instructions, we need to an instruction that serializes execution. Precise details on how exactly to do that are provided in this document by Intel.

Memory / Concurrency

In addition to the ordering restrictions above, which imply that loads and stores to the same location can't be reordered with respect to each other, x86 loads and stores have some other restrictions. In particular, for a single CPU, stores are never reordered with other stores, and stores are never reordered with earlier loads, regardless of whether or not they're to the same location.

However, loads can be reordered with earlier stores. For example, if you write

mov 1, [%esp]
mov [%ebx], %eax

it can be executed as if you wrote

mov [%ebx], %eax
mov 1, [%esp]

But the converse isn't true — if you write the latter, it can never be executed as if you wrote the former.

You could force the first example to execute as written by inserting a serializing instruction. But that requires the CPU to serialize all instructions. But that's slow, since it effectively forces the CPU to wait until all instructions before the serializing instruction are done before executing anything after the serializing instruction. There's also an mfence instruction that only serializes loads and stores, if you only care about load/store ordering.

I'm not going to discuss the other memory fences, lfence and sfence, but you can read more about them here.

We've looked at single core ordering, where loads and stores are mostly ordered, but there's also multi-core ordering. The above restrictions all apply; if core0 is observing core1, it will see that all of the single core rules apply to core1's loads and stores. However, if core0 and core1 interact, there's no guarantee that their interaction is ordered.

For example, say that core0 and core 1 start with eax and edx set to 0, and core 0 executes

mov 1, [_foo]
mov [_foo], %eax
mov [_bar], %edx

while core1 executes

mov 1, [_bar]
mov [_bar], %eax
mov [_foo], %edx

For both cores, eax has to be 1 because of the within-core dependency between the first instruction and the second instruction. However, it's possible for edx to be 0 in both cores because line 3 of core0 can execute before core0 sees anything from core1, and visa versa.

That covers memory barriers, which serialize memory accesses within a core. Since stores are required to be seen in a consistent order across cores, they can, they also have an effect on cross-core concurrency, but it's pretty difficult to reason about that kind of thing correctly. Linus has this to say on using memory barriers instead of locking:

The real cost of not locking also often ends up being the inevitable bugs. Doing clever things with memory barriers is almost always a bug waiting to happen. It's just really hard to wrap your head around all the things that can happen on ten different architectures with different memory ordering, and a single missing barrier. … The fact is, any time anybody makes up a new locking mechanism, THEY ALWAYS GET IT WRONG. Don't do it.

And it turns out that on modern x86 CPUs, using locking to implement concurrency primitives is often cheaper than using memory barriers, so let's look at locks.

If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2. If it's not obvious why the theoretical minimum is 2 and not 10000, figuring that out is a good exercise. If it is obvious, my bonus exercise for you is, can any reasonable CPU implementation get that result, or is that some silly thing the spec allows that will never happen? There isn't enough information in this post to answer the bonus question, but I believe I've linked to enough information.

We can try this with a simple code snippet

#include <stdlib.h>
#include <thread>

#define NUM_ITERS 10000
#define NUM_THREADS 2

int counter = 0;
int *p_counter = &counter;

void asm_inc() {
  int *p_counter = &counter;
  for (int i = 0; i < NUM_ITERS; ++i) {
    __asm__("incl (%0) \n\t" : : "r" (p_counter));
  }
}

int main () {
  std::thread t[NUM_THREADS];
  for (int i = 0; i < NUM_THREADS; ++i) {
    t[i] = std::thread(asm_inc);
  }
  for (int i = 0; i < NUM_THREADS; ++i) {
    t[i].join();
  }
  printf("Counter value: %i\n", counter);
  return 0;
}

Compiling the above with clang++ -std=c++11 -pthread, I get the following distribution of results on two of my machines:

Different distributions of non-determinism on Haswell and Sandy Bridge

Not only do the results vary between runs, the distribution of results is different on different machines. We never hit the theoretical minimum of 2, or for that matter, anything below 10000, but there's some chance of getting a final result anywhere between 10000 and 20000.

Even though incl is a single instruction, it's not guaranteed to be atomic. Internally, incl is implemented as a load followed by an add followed by an store. It's possible for an increment on cpu0 to sneak in and execute between the load and the store on cpu1 and visa versa.

The solution Intel has for this is the lock prefix, which can be added to a handful of instructions to make them atomic. If we take the above code and turn incl into lock incl, the resulting output is always 20000.

So, that's how we make a single instruction atomic. To make a sequence atomic, we can use xchg or cmpxchg, which are always locked as compare-and-swap primitives. I won't go into detail about how that works, but see this article by David Dalrymple if you're curious..

In addition to making a memory transaction atomic, locks are globally ordered with respect to each other, and loads and stores aren't re-ordered with respect to locks.

For a rigorous model of memory ordering, see the x86 TSO doc.

All of this discussion has been how about how concurrency works in hardware. Although there are limitations on what x86 will re-order, compilers don't necessarily have those same limitations. In C or C++, you'll need to insert the appropriate primitives to make sure the compiler doesn't re-order anything. As Linus points out here, if you have code like

local_cpu_lock = 1;
// .. do something critical ..
local_cpu_lock = 0;

the compiler has no idea that local_cpu_lock = 0 can't be pushed into the middle of the critical section. Compiler barriers are distinct from CPU memory barriers. Since the x86 memory model is relatively strict, some compiler barriers are no-ops at the hardware level that tell the compiler not to re-order things. If you're using a language that's higher level than microcode, assembly, C, or C++, your compiler probably handles this for you without any kind of annotation.

Memory / Porting

If you're porting code to other architectures, it's important to note that x86 has one of the strongest memory models of any architecture you're likely to encounter nowadays. If you write code that just works without thinking it through and port it to architectures that have weaker guarantees (PPC, ARM, or Alpha), you'll almost certainly have bugs.

Consider this example:

Initial
-----
x = 1;
y = 0;
p = &x;

CPU1         CPU2
----         ----
i = *p;      y = 1;
             MB;
             p = &y;

MB is a memory barrier. On an Alpha 21264 system, this can result in i = 0.

Kourosh Gharachorloo explains how:

CPU2 does y=1 which causes an "invalidate y" to be sent to CPU1. This invalidate goes into the incoming "probe queue" of CPU1; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on CPU1. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in CPU1's cache before sending the acknowledgment). Therefore, CPU2 can go through its MB. And it proceeds to do the write to p. Now CPU1 proceeds to read p. The reply for read p is allowed to bypass the probe queue on CPU1 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, CPU1 can derefence p to read the old value of y that is sitting in its cache (the invalidate y in CPU1's probe queue is still sitting there).

How does an MB on CPU1 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of p, you do an MB which pulls in the invalidate to y for sure. And you can no longer see the old cached value for y.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, CPU1 will likely have ample opportunity to service the messages (i.e., invalidate) in its probe queue before it receives the data reply for "read p". Nonetheless, if you get into a situation where you have placed many things in CPU1's probe queue ahead of the invalidate to y, then it is possible that the reply to p comes back and bypasses this invalidate. It would be difficult for you to set up the scenario though and actually observe the anomaly.

This is long enough without my talking about other architectures so I won't go into detail, but if you're wondering why anyone would create a spec that allows this kind of optimization, consider that before rising fab costs crushed DEC, their chips were so fast that they could run industry standard x86 benchmarks of real workloads in emulation faster than x86 chips could run the same benchmarks natively. For more explanation of why the most RISC-y architecture of the time made the decisions it did, see this paper on the motivations behind the Alpha architecture.

BTW, this is a major reason I'm skeptical of the Mill architecture. Putting aside arguments about whether or not they'll live up to their performance claims, being technically excellent isn't, in and of itself, a business model.

Memory / Non-Temporal Stores / Write-Combine Memory

The set of restrictions outlined in the previous section apply to cacheable (i.e., “write-back” or WB) memory. That, itself, was new at one time. Before that, there was only uncacheable (UC) memory.

One of the interesting things about UC memory is that all loads and stores are expected to go out to the bus. That's perfectly reasonable in a processor with no cache and little to no on-board buffering. A result of that is that devices that have access to memory can rely on all accesses to UC memory regions creating separate bus transactions, in order (because some devices will use a memory read or write as as trigger to do something). That worked great in 1982, but it's not so great if you have a video card that just wants to snarf down whatever the latest update is. If multiple writes happen to the same UC location (or different bytes of the same word), the CPU is required to issue a separate bus transaction for each write, even though a video card doesn't really care about seeing each intervening result.

The solution to that was to create a memory type called write combine (WC). WC is a kind of eventually consistent UC. Writes have to eventually make it to memory, but they can be buffered internally. WC memory also has weaker ordering guarantees than UC.

For the most part, you don't have to deal with this unless you're talking directly with devices. The one exception are “non-temporal” load and store operations. These make particular loads and stores act like they're to WC memory, even if the address is in a memory region that's marked WB.

This is useful if you don't want to pollute your caches with something. This is often useful if you're doing some kind of streaming calculation where you know you're not going to use a particular piece of data more than once.

Memory / NUMA

Non-uniform memory access, where memory latencies and bandwidth are different for different processors, is so common that we mostly don't talk about NUMA or ccNUMA anymore because they're so common that it's assumed to be the default.

The takeaway here is that threads that share memory should be on the same socket, and a memory-mapped I/O heavy thread should make sure it's on the socket that's closest to the I/O device it's talking to.

I've mostly avoided explaining the why behind things because that would make this post at least an order of magnitude longer than it's going to be. But I'll give a vastly oversimplified explanation of why we have NUMA systems, partially because it's a self-contained thing that's relatively easy to explain and partially to demonstrate how long the why is compared to the what.

Once upon a time, there was just memory. Then CPUs got fast enough relative to memory that people wanted to add a cache. It's bad news if the cache is inconsistent with the backing store (memory), so the cache has to keep some information about what it's holding on to so it knows if/when it needs to write things to the backing store.

That's not too bad, but once you get 2 cores with their own caches, it gets a little more complicated. To maintain the same programming model as the no-cache case, the caches have to be consistent with each other and with the backing store. Because existing load/store instructions have nothing in their API that allows them to say sorry! this load failed because some other CPU is holding onto the address you want, the simplest thing was to have every CPU send a message out onto the bus every time it wanted to load or store something. We've already got this memory bus that both CPUs are connected to, so we just require that other CPUs respond with the data (and invalidate the appropriate cache line) if they have a modified version of the data in their cache.

That works ok. Most of the time, each CPU only touches data the other CPU doesn't care about, so there's some wasted bus traffic. But it's not too bad because once a CPU puts out a message saying Hi! I'm going to take this address and modify the data, it can assume it completely owns that address until some other CPU asks for it, which will probably won't happen. And instead of doing things on a single memory address, we can operate on cache lines that have, say, 64 bytes. So, the overall overhead is pretty low.

It still works ok for 4 CPUs, although the overhead is a bit worse. But this thing where each CPU has to respond to every other CPU's fails to scale much beyond 4 CPUs, both because the bus gets saturated and because the caches will get saturated (the physical size/cost of a cache is O(n^2) in the number of simultaneous reads and write supported, and the speed is inversely correlated to the size).

A “simple” solution to this problem is to have a single centralized directory that keeps track of all the information, instead of doing N-way peer-to-peer broadcast. Since we're packing 2-16 cores on a chip now anyway, it's pretty natural to have a single directory per chip (socket) that tracks the state of the caches for every core on a chip.

This only solves the problem for each chip, and we need some way for the chips to talk to each other. Unfortunately, while we were scaling these systems up the bus speeds got fast enough that it's really difficult to drive a signal far enough to connect up a bunch of chips and memory all on one bus, even for small systems. The simplest solution to that is to have each socket own a region of memory, so every socket doesn't need to be connected to every part of memory. This also avoids the complexity of needed a higher level directory of directories, since it's clear which directory owns any particular piece of memory.

The disadvantage of this is that if you're sitting in one socket and want some memory owned by another socket, you have a significant performance penalty. For simplicity, most “small” (< 128 core) systems use ring-like busses, so the performance penalty isn't just the direct latency/bandwidth penalty you pay for walking through a bunch of extra hops to get to memory, it also uses up a finite resource (the ring-like bus) and slows down other cross-socket accesses.

In theory, the OS handles this transparently, but it's often inefficient.

Context Switches / Syscalls

Here, syscall refers to a linux system call, not the SYSCALL or SYSENTER x86 instructions.

A side effect of all the caching that modern cores have is that context switches are expensive, which causes syscalls to be expensive. Livio Soares and Michael Stumm discuss the cost in great detail in their paper. I'm going to use a few of their figures, below. Here's a graph of how many instructions per clock (IPC) a Core i7 achieves on Xalan, a sub-benchmark from SPEC CPU.

Long tail of overhead from a syscall. 14,000 cycles.

14,000 cycles after a syscall, code is still not quite running at full speed.

Here's a table of the footprint of a few different syscalls, both the direct cost (in instructions and cycles), and the indirect cost (from the number of cache and TLB evictions).

Cost of stat, pread, pwrite, open+close, mmap+munmap, and open+write+close

Some of these syscalls cause 40+ TLB evictions! For a chip with a 64-entry d-TLB, that nearly wipes out the TLB. The cache evictions aren't free, either.

The high cost of syscalls is the reason people have switched to using batched versions of syscalls for high-performance code (e.g., epoll, or recvmmsg) and the reason that people who need very high performance I/O often use user space I/O stacks. More generally, the cost of context switches is why high-performance code is often thread-per-core (or even single threaded on a pinned thread) and not thread-per-logical-task.

This high cost was also the driver behind vDSO, which turns some simple syscalls that don't require any kind of privilege escalation into simple user space library calls.

SIMD

Basically all modern x86 CPUs support SSE, 128-bit wide vector registers and instructions. Since it's common to want to do the same operation multiple times, Intel added instructions that will let you operate on a 128-bit chunk of data as 2 64-bit chunks, 4 32-bit chunks, 8 16-bit chunks, etc. ARM supports the same thing with a different name (NEON), and the instructions supported are pretty similar.

It's pretty common to get a 2x-4x speedup from using SIMD instructions; it's definitely worth looking into if you've got a computationally heavy workload.

Compilers are good enough at recognizing common patterns that can be vectorized that simple code, like the following, will automatically use vector instructions with modern compilers

for (int i = 0; i < n; ++i) {
  sum += a[i];
}

But compilers will often produce non-optimal code if you don't write the assembly by hand, especially for SIMD code, so you'll want to look at the disassembly and check for compiler optimization bugs if you really care about getting the best possible performance.

Power Management

There are a lot of fancy power management feature on modern CPUs that optimize power usage in different scenarios. The result of these is that “race to idle”, completing work as fast as possible and then letting the CPU go back to sleep is the most power efficient way to work.

There's been a lot of work that's shown that specific microoptmizations can benefit power consumption, but applying those microoptimizations on real workloads often results in smaller than expected benefits.

GPU / GPGPU

I'm even less qualified to talk about this than I am about the rest of this stuff. Luckily, Cliff Burdick volunteered to write a section on GPUs, so here it is.

Prior to the mid-2000's, Graphical Processing Units (GPUs) were restricted to an API that allowed only a very limited amount of control of the hardware. As the libraries became more flexible, programmers began using the processors for more general-purpose tasks, such as linear algebra routines. The parallel architecture of the GPU could work on large chunks of a matrix by launching hundreds of simultaneous threads. However, the code had to use traditional graphics APIs and was still limited in how much of the hardware it could control. Nvidia and ATI took notice and released frameworks that allowed the user to access more of the hardware with an API familiar with people outside of the graphics industry. The libraries gained popularity, and today GPUs are widely used for high-performance computing (HPC) alongside CPUs.

Compared to CPUs, the hardware on GPUs have a few major differences, outlined below:

Processors

At the top level, a GPU processor contains one or many streaming multiprocessors (SMs). Each streaming multiprocessor on a modern GPU typically contains over 100 floating point units, or what are typically referred to as cores in the GPU world. Each core is typically clocked around 800MHz, although, like CPUs, processors with higher clock rates but fewer cores are also available. GPU processors lack many features of their CPU counterparts, including large caches and branch prediction. Between the layers of cores, SMs, and the overall processor, communicating becomes increasingly slower. For this reason, problems that perform well on GPUs are typically highly-parallel, but have some amount of data that can be shared between a small number of threads. We'll get into why this is in the memory section below.

Memory

Memory on modern GPU is broken up into 3 main categories: global memory, shared memory, and registers. Global memory is the GDDR memory that's advertised on the box of the GPU and is typically around 2-12GB in size, and has a throughput of 300-400GB/s. Global memory can be accessed by all threads across all SMs on the processor, and is also the slowest type of memory on the card. Shared memory is, as the name says, memory that's shared between all threads within the same SM. It is usually at least twice as fast as global memory, but is not accessible between threads on different SMs. Registers are much like registers on a CPU in that they are the fastest way to access data on a GPU, but they are local per thread and the data is not visible to any other running thread. Both shared memory and global memory have very strict rules on how they can be accessed, with severe performance penalties for not following them. To reach the throughputs mentioned above, memory accesses must be completely coalesced between threads within the same thread group. Similar to a CPU reading into a single cache line, GPUs have cache lines sized so that a single access can serve all threads in a group if aligned properly. However, in the worst case where all threads in a group access memory in a different cache line, a separate memory read will be required for each thread. This usually means that most of the data in the cache line is not used by the thread, and the usable throughput of the memory goes down. A similar rule applies to shared memory as well, with a couple exceptions that we won't cover here.

Threading Model

GPU threads run in a SIMT (Single Instruction Multiple Thread) fashion, and each thread runs in a group with a pre-defined size in the hardware (typically 32). That last part has many implications; every thread in that group must be working on the same instruction at the same time. If any of the threads in a group need to take a divergent path (an if statement, for example) of code from the others, all threads not part of the branch suspend execution until the branch is complete. As a trivial example:

if (threadId < 5) {
   // Do something
}
// Do More

In the code above, this branch would cause 27 of our 32 threads in the group to suspend execution until the branch is complete. You can imagine if many groups of threads all run this code, the overall performance will take a large hit while most of the cores sit idle. Only when an entire group of threads is stalled is the hardware allowed to swap in another group to run on those cores.

Interfaces

Modern GPUs must have a CPU to copy data to and from CPU and GPU memory, and to launch and code on the GPU. At the highest throughput, a PCIe 3.0 bus with 16 lanes can achieves rates of about 13-14GB/s. This may sound high, but when compared to the memory speeds residing on the GPU itself, they're over an order of magnitude slower. In fact, as GPUs get more powerful, the PCIe bus is increasingly becoming a bottleneck. To see any of the performance benefits the GPU has over a CPU, the GPU must be loaded with a large amount of work so that the time the GPU takes to run the job is significantly higher than the time it takes to copy the data to and from.

Newer GPUs have features to launch work dynamically in GPU code without returning to the CPU, but it's fairly limited in its use at this point.

GPU Conclusion

Because of the major architectural differences between CPUs and GPUs, it's hard to imagine either one replacing the other completely. In fact, a GPU complements a CPU well for parallel work and allows the CPU to work independently on other tasks as the GPU is running. AMD is attempting to merge the two technologies with their "Heterogeneous System Architecture" (HSA), but taking existing CPU code and determining how to split it between the CPU and GPU portion of the processor will be a big challenge not only for the processor, but for compilers as well.

Virtualization

Since you mentioned virtualization, I'll talk about it a bit, but Intel's implementation of virtualization instructions generally isn't something you need to think about unless you're writing very low-level code that directly deals with virtualization.

Dealing with that stuff is pretty messy, as you can see from this code. Setting stuff up to use Intel's VT instructions to launch a VM guest is about 1000 lines of low-level code, even for the very simple case shown there.

Virtual Memory

If you look at Vish's VT code, you'll notice that there's a decent chunk of code dedicated to page tables / virtual memory. That's another “new” feature that you don't have to worry about unless you're writing an OS or other low-level systems code. Using virtual memory is much simpler than using segmented memory, but that's not relevant nowadays so I'll just leave it at that.

SMT / Hyper-threading

Since you brought it up, I'll also mention SMT. As you said, this is mostly transparent for programmers. A typical speedup for enabling SMT on a single core is around 25%. That's good for overall throughput, but it means that each thread might only get 60% of its original performance. For applications where you care a lot about single-threaded performance, you might be better off disabling SMT. It depends a lot on the workload, though, and as with any other changes, you should run some benchmarks on your exact workload to see what works best.

One side effect of all this complexity that's been added to chips (and software) is that performance is a lot less predictable than it used to be; the relative importance of benchmarking your exact workload on the specific hardware it's going to run on has gone up.

Just for example, people often point to benchmarks from the Computer Languages Benchmarks Game as evidence that one language is faster than another. I've tried reproducing the results myself, and on my mobile Haswell (as opposed to the server Kentsfield that's used in the results), I get results that are different by as much as 2x (in relative speed). Running the same benchmark on the same machine, Nathan Kurz recently pointed me to an example where gcc -O3 is 25% slower than gcc -O2. Changing the linking order on C++ programs can cause a 15% performance change. Benchmarking is a hard problem.

Branches

Old school conventional wisdom is that branches are expensive, and should be avoided at all (or most) costs. On a Haswell, the branch misprediction penalty is 14 cycles. Branch mispredict rates depend on the workload. Using perf stat on a few different things (bzip2, top, mysqld, regenerating my blog), I get branch mispredict rates of between 0.5% and 4%. If we say that a correctly predicted branch costs 1 cycle, that's an average cost of between .995 * 1 + .005 * 14 = 1.065 cycles to .96 * 1 + .04 * 14 = 1.52 cycles. That's not so bad.

This actually overstates the penalty since about 1995, since Intel added conditional move instructions that allow you to conditionally move data without a branch. This instruction was memorably panned by Linus, which has given it a bad reputation, but it's fairly common to get significant speedups using cmov compared to branches

A real-world example of the cost of extra branches are enabling integer overflow checks. When using bzip2 to compress a particular file, that increases the number of instructions by about 30% (with all of the increase coming from extra branch instructions), which results in a 1% performance hit.

Unpredictable branches are bad, but most branches are predictable. Ignoring the cost of branches until your profiler tells you that you have a hot spot is pretty reasonable nowadays. CPUs have gotten a lot better at executing poorly optimized code over the past decade, and compilers are getting better at optimizing code, which makes optimizing branches a poor use of time unless you're trying to squeeze out the absolute best possible performance out of some code.

If it turns out that's what you need to do, you're likely to be better off using profile-guided optimization than trying to screw with this stuff by hand.

If you really must do this by hand, there are compiler directives you can use to say whether a particular branch is likely to be taken or not. Modern CPUs ignore branch hint instructions, but they can help the compiler lay out code better.

Alignment

Old school conventional wisdom is that you should pad out structs and make sure things are aligned. But on a Haswell chip, the mis-alignment for almost any single-threaded thing you can think of that doesn't cross a page boundary is zero. There are some cases where it can make a difference, but in general, this is another type of optimization that's mostly irrelevant because CPUs have gotten so much better at executing bad code. It's also mildly harmful in cases where it increases the memory footprint for no benefit.

Also, don't make things page aligned or otherwise aligned to large boundaries or you'll destroy the performance of your caches.

Self-modifying code

Here's another optimization that doesn't really make sense anymore. Using self-modifying code to decrease code size or increase performance used to make sense, but because modern caches tend to split up their l1 instruction and data caches, modifying running code requires expensive communication between a chip's l1 caches.

The Future

Here are some possible changes, from least speculative to most speculative.

Partitioning

It's now obvious that more and more compute is moving into large datacenters. Sometimes this involves running on VMs, sometimes it involves running in some kind of container, and sometimes it involves running bare metal, but in any case, individual machines are often multiplexed to run a wide variety of workloads. Ideally, you'd be able to schedule best effort workloads to soak up stranded resources without effecting latency sensitive workloads with an SLA. It turns out that you can actually do this with some relatively straightforward hardware changes.

90% overall machine utilization

David Lo, et. al, were able to show that you can get about 90% machine utilization without impacting latency SLAs if caches can be partitioned such that best effort workloads don't impact latency sensitive workloads. The solid red line is the load on a normal Google web search cluster, and the dashed green line is what you get with the appropriate optimizations. From bar-room conversations, my impression is that the solid red line is actually already better (higher) than most of Google's competitors are able to do. If you compare the 90% optimized utilization to typical server utilization of 10% to 90%, that results in a massive difference in cost per unit of work compared to running a naive, unoptimized, setup. With substantial hardware effort, Google was able to avoid interference, but additional isolation features could allow this to be done at higher efficiency with less effort.

Transactional Memory and Hardware Lock Elision

IBM already has these features in their POWER chips. Intel made an attempt to add these to Haswell, but they're disabled because of a bug. In general, modern CPUs are quite complex and we should expect to see many more bugs than we used to.

Transactional memory support is what it sounds like: hardware support for transactions. This is through three new instructions, xbegin, xend, and xabort.

xbegin starts a new transaction. A conflict (or an xabort) causes the architectural state of the processor (including memory) to get rolled back to the state it was in just prior to the xbegin. If you're using transactional memory via library or language support, this should be transparent to you. If you're implementing the library support, you'll have to figure out how to convert this hardware support, with its limited hardware buffer sizes, to something that will handle arbitrary transactions.

I'm not going to discuss Hardware Lock Elision except to say that, under the hood, it's implemented with mechanisms that are really similar to the mechanisms used to implement transactional memory and that it's designed to speed up lock-based code. If you want to take advantage of HLE, see this doc.

Fast I/O

I/O bandwidth is going up and I/O latencies are going down, both for storage and for networking. The problem is that I/O is normally done via syscalls. As we've seen, the relative overhead of syscalls has been going up. For both storage and networking, the answer is to move to user mode I/O stacks (putting everything in kernel mode would work, too, but that's a harder sell). On the storage side, that's mostly still a weirdo research thing, but HPC and HFT folks have been doing that in networking for a while. And by a while, I don't mean a few months. Here's a paper from 2005 that talks about the networking stuff I'm going to discuss, as well as some stuff I'm not going to discuss (DCA).

This is finally trickling into the non-supercomputing world. MS has been advertising Azure with infiniband networking with virtualized RDMA for over a year, Cloudflare has talked about using Solarflare NICs to get the same capability, etc. Eventually, we're going to see SoCs with fast Ethernet onboard, and unless that's limited to Xeon-type devices, it's going to trick down into all devices. The competition between ARM devices will probably cause at least one ARM device maker to put fast Ethernet on their commodity SoCs, which may force Intel's hand.

That RDMA bit is significant; it lets you bypass the CPU completely and have the NIC respond to remote requests. A couple months ago, I worked through the Stanford/Coursera Mining Massive Data Sets class. During one of the first lectures, they provide an example of a “typical” datacenter setup with 1Gb top-of-rack switches. That's not unreasonable for processing “massive” data if you're doing kernel TCP through non-RDMA NICs, since you can floor an entire core trying to push 1Gb/s through linux's TCP stack. But with Azure, MS talks about getting 40Gb out of a single machine; that's one machine getting 40x the bandwidth of what you might expect out of an entire rack. They also mention sub 2 us latencies, which is multiple orders of magnitude lower than you can get out of kernel TCP. This isn't exactly a new idea. This paper from 2011 predicts everything that's happened on the network side so far, along with some things that are still a ways off.

This MS talk discusses how you can take advantage of this kind of bandwidth and latency for network storage. A concrete example that doesn't require clicking through to a link is Amazon's EBS. It lets you use an “elastic” disk of arbitrary size on any of your AWS nodes. Since a spinning metal disk seek has higher latency than an RPC over kernel TCP, you can get infinite storage pretty much transparently. For example, say you can get 100us (.1ms) latency out of your network, and your disk seek time is 8ms. That makes a remote disk access 8.1ms instead of 8ms, which isn't that much overhead. That doesn't work so well with SSDs, though, since you can get 20 us (.02ms) out of an SSD. But RDMA latency is low enough that a transparent EBS-like layer is possible for SSDs.

So that's networked I/O. The performance benefit might be even bigger on the disk side, if/when next generation storage technologies that are faster than flash start getting deployed. The performance delta is so large that Intel is adding new instructions to keep up with next generation low-latency storage technology. Depending on who you ask, that stuff has been a few years away for a decade or two; this is more iffy than the networking stuff. But even with flash, people are showing off devices that can get down into the single microsecond range for latency, which is a substantial improvement.

Hardware Acceleration

Like fast networked I/O, this is already here in some niches. DESRES has been doing ASICs to get 100x-1000x speedup in computational chemistry for years. Microsoft has talked about speeding up search with FPGAs. People have been looking into accelerating memcached and similar systems for a while, researchers from Toshiba and Stanford demonstrated a real implementation a while back, and I recently saw a pre-print out of Berkeley on the same thing. There are multiple companies making Bitcoin mining ASICs. That's also true for other application areas.

It seems like we should see more of this as it gets harder to get power/performance gains out of CPUs. You might consider this a dodge of your question, if you think of programming as being a software oriented endeavor, but another way to look at it is that what it means to program something will change. In the future, it might mean designing hardware like an FPGA or ASIC in combination with writing software.

Update

Now that it's 2016, one year after this post was originally published, we can see that companies are investing in hardware accelerators. In addition to its previous work on FPGA accelerated search, Microsoft has announced that it's using FPGAs to accelerate networking. Google has been closed mouthed about infrastructure, as is typical for them, but if you look at the initial release of Tensorflow, you can see snippets of code that clearly references FPGAs, such as:

enum class PlatformKind {
  kInvalid,
  kCuda,
  kOpenCL,
  kOpenCLAltera,  // Altera FPGA OpenCL platform.
                  // See documentation: go/fpgaopencl
                  // (StreamExecutor integration)
  kHost,
  kMock,
  kSize,
};

and

string PlatformKindString(PlatformKind kind) {
  switch (kind) {
    case PlatformKind::kCuda:
      return "CUDA";
    case PlatformKind::kOpenCL:
      return "OpenCL";
    case PlatformKind::kOpenCLAltera:
      return "OpenCL+Altera";
    case PlatformKind::kHost:
      return "Host";
    case PlatformKind::kMock:
      return "Mock";
    default:
      return port::StrCat("InvalidPlatformKind(", static_cast<int>(kind), ")");
  }
}

As of this writing, Google doesn't return any results for +google +kOpenClAltera, so it doesn't appear that this has been widely observed. If you're not familiar with Altera OpenCL and you work at google, you can try the internal go link suggested in the comment, go/fpgaopencl. If, like me, you don't work at Google, well, there's Altera's docs here. The basic idea is that you can take OpenCL code, the same kind of thing you might run on a GPU, and run it on an FPGA instead, and from the comment, it seems like Google has some kind of setup that lets you stream data in and out of nodes with FPGAs.

That FPGA-specific code was removed in ddd4aaf5286de24ba70402ee0ec8b836d3aed8c7, which has a commit message that starts with “TensorFlow: upstream changes to git.” and then has a list of internal google commits that are being upstreamed, along with a description of each internal commit. Curiously, there's nothing about removing FPGA support even though that seems like it's a major enough thing that you'd expect it to be described, unless it was purposely redacted. Amazon has also been quite secretive about their infrastructure plans, but you can make reasonable guesses there by looking at the hardware people they've been vacuuming up. A couple other companies are also betting pretty heavily on hardware accelerators, but since I learned about that through private conversations (as opposed to accidentally published public source code or other public information), I'll leave you to guess which companies.

Dark Silicon / SoCs

One funny side effect of the way transistor scaling has turned out is that we can pack a ton of transistors on a chip, but they generate so much heat that the average transistor can't switch most of the time if you don't want your chip to melt.

A result of this is that it makes more sense to include dedicated hardware that isn't used a lot of the time. For one thing, this means we get all sorts of specialized instructions like the PCMP and ADX instructions. But it also means that we're getting chips with entire devices integrated that would have previously lived off-chip. That includes things like GPUs and (for mobile devices) radios.

In combination with the hardware acceleration trend, it also means that it makes more sense for companies to design their own chips, or at least parts of their own chips. Apple has gotten a lot of mileage out of acquiring PA Semi. First, by adding little custom accelerators to bog standard ARM architectures, and then by adding custom accelerators to their own custom architecture. Due to a combination of the right custom hardware plus well thought out benchmarking and system design, the iPhone 4 is slightly more responsive than my flagship Android phone, which is multiple years newer and has a much faster processor as well as more RAM.

Amazon has picked up a decent chunk of the old Calxeda team and are hiring enough to create a good-sized hardware design team. Facebook has picked up a small handful of ARM SoC folks and is partnering with Qualcomm on something-or-other. Linus is on record as saying we're going to see more dedicated hardware all over the place. And so on and so forth.

Conclusion

x86 chips have picked up a lot of new features and whiz-bang gadgets. For the most part, you don't have to know what they are to take advantage of them. As a first-order approximation, making your code predictable and keeping memory locality in mind works pretty well. The really low-level stuff is usually hidden by libraries or drivers, and compilers will try to take care of the rest of it. The exceptions are if you're writing really low-level code, in which case the world has gotten a lot messier, or if you're trying to get the absolute best possible performance out of your code, in which case the world has gotten a lot weirder.

Also, things will happen in the future. But most predictions are wrong, so who knows?

Resources

This is a talk by Matt Godbolt that covers a lot of the implementation details that I don't get into. To down into one more level of detail, see Modern Processor Design, by Shen and Lipasti. Despite the date listed on Amazon (2013), the book is pretty old, but it's still the best book I've found on processor internals. It describes, in good detail, what you need to implement to make a P6-era high-performance CPU. It also derives theoretical performance limits given different sets of assumptions and talks about a lot of different engineering tradeoffs, with explanations of why for a lot of them.

For one level deeper of "why", you'll probably need to look at a VLSI text, which will explain how devices and interconnect scale and how that affects circuit design, which in turn affects architecture. I really like Weste & Harris because they have clear explanations and good exercises with solutions that you can find online, but if you're not going to work the problems pretty much any VLSI text will do. For one more level deeper of the "why" of things, you'll want a solid state devices text and something that explains how transmission lines and interconnect can work. For devices, I really like Pierret's books. I got introduced to the E-mag stuff through Ramo, Whinnery & Van Duzer, but Ida is a better intro text.

For specifics about current generation CPUs and specific optimization techniques, see Agner Fog's site. For something on optimization tools from the future, see this post. What Every Programmer Should Know About Memory is also good background knowledge. Those docs cover a lot of important material, but if you're writing in a higher level language there are a lot of other things you need to keep in mind. For more on Intel CPU history, Xao-Feng Li has a nice overview.

For something a bit off the wall, see this post on the possibility of CPU backdoors. For something less off the wall, see this post on how complexity we have in modern CPUs enables all sorts of exciting bugs.

For more benchmarks on locking, See this post by Aleksey Shipilev, this post by Paul Khuong, as well as their archives.

For general benchmarking, last year's Strange Loop benchmarking talk by Aysylu Greenberg is a nice intro to common gotchas. For something more advanced but more specific, Gil Tene's talk on latency is great.

For historical computing that predates everything I've mentioned by quite some time, see IBM's Early Computers and Design of a Computer, which describes the design of the CDC 6600. Readings in Computer Architecture is also good for seeing where a lot of these ideas originally came from.

Sorry, this list is pretty incomplete. Suggestions welcome!

Tiny Disclaimer

I have no doubt that I'm leaving stuff out. Let me know if I'm leaving out anything you think is important and I'll update this. I've tried to keep things as simple as possible while still capturing the flavor of what's going on, but I'm sure that there are some cases where I'm oversimplifying, and some things that I just completely forgot to mention. And of course basically every generalization I've made is wrong if you're being really precise. Even just picking at my first couple sentences, A20M isn't always and everywhere irrelevant (I've probably spent about 0.2% of my career dealing with it), x86-64 isn't strictly superior to x86 (on one workload I had to deal with, the performance benefit from the extra registers was more than canceled out by the cost of the longer instructions; it's pretty rare that the instruction stream and icache misses are the long pole for a workload, but it happens), etc. The biggest offender is probably in my NUMA explanation, since it is actually possible for P6 busses to respond with a defer or retry to a request. It's reasonable to avoid using a similar mechanism to enforce coherency but I couldn't think of a reasonable explanation of why that didn't involve multiple levels of explanations. I'm really not kidding when I say that pretty much every generalization falls apart if you dig deep enough. Every abstraction I'm talking about is leaky. I've tried to include links to docs that go at least one level deeper, but I'm probably missing some areas.

Acknowledgments

Thanks to Leah Hanson and Nathan Kurz for comments that results in major edits, and to Nicholas Radcliffe, Stefan Kanthak, Garret Reid, Matt Godbolt, Nikos Patsiouras, Aleksey Shipilev, and Oscar R Moll for comments that resulted in minor edits, and to David Albert for allowing me to quote him and also for some interesting follow-up questions when we talked about this a while back. Also, thanks for Cliff Burdick for writing the section on GPUs and for Hari Angepat for spotting the Google kOpenCLAltera code in TensorFlow.


A review of the Julia language

2014-12-28 08:00:00

Here's a language that gives near-C performance that feels like Python or Ruby with optional type annotations (that you can feed to one of two static analysis tools) that has good support for macros plus decent-ish support for FP, plus a lot more. What's not to like? I'm mostly not going to talk about how great Julia is, though, because you can find plenty of blog posts that do that all over the internet.

The last time I used Julia (around Oct. 2014), I ran into two new (to me) bugs involving bogus exceptions when processing Unicode strings. To work around those, I used a try/catch, but of course that runs into a non-deterministic bug I've found with try/catch. I also hit a bug where a function returned a completely wrong result if you passed it an argument of the wrong type instead of throwing a "no method" error. I spent half an hour writing a throwaway script and ran into four bugs in the core language.

The second to last time I used Julia, I ran into too many bugs to list; the worst of them caused generating plots to take 30 seconds per plot, which caused me to switch to R/ggplot2 for plotting. First there was this bug with plotting dates didn't work. When I worked around that I ran into a regression that caused plotting to break large parts of the core language, so that data manipulation had to be done before plotting. That would have been fine if I knew exactly what I wanted, but for exploratory data analysis I want to plot some data, do something with the data, and then plot it again. Doing that required restarting the REPL for each new plot. That would have been fine, except that it takes 22 seconds to load Gadfly on my 1.7GHz Haswell (timed by using time on a file that loads Gadfly and does no work), plus another 10-ish seconds to load the other packages I was using, turning my plotting workflow into: restart REPL, wait 30 seconds, make a change, make a plot, look at a plot, repeat.

It's not unusual to run into bugs when using a young language, but Julia has more than its share of bugs for something at its level of maturity. If you look at the test process, that's basically inevitable.

As far as I can tell, FactCheck is the most commonly used thing resembling a modern test framework, and it's barely used. Until quite recently, it was unmaintained and broken, but even now the vast majority of tests are written using @test, which is basically an assert. It's theoretically possible to write good tests by having a file full of test code and asserts. But in practice, anyone who's doing that isn't serious about testing and isn't going to write good tests.

Not only are existing tests not very good, most things aren't tested at all. You might point out that the coverage stats for a lot of packages aren't so bad, but last time I looked, there was a bug in the coverage tool that caused it to only aggregate coverage statistics for functions with non-zero coverage. That is to say, code in untested functions doesn't count towards the coverage stats! That, plus the weak notion of test coverage that's used (line coverage1) make the coverage stats unhelpful for determining if packages are well tested.

The lack of testing doesn't just mean that you run into regression bugs. Features just disappear at random, too. When the REPL got rewritten a lot of existing shortcut keys and other features stopped working. As far as I can tell, that wasn't because anyone wanted it to work differently. It was because there's no way to re-write something that isn't tested without losing functionality.

Something that goes hand-in-hand with the level of testing on most Julia packages (and the language itself) is the lack of a good story for error handling. Although you can easily use Nullable (the Julia equivalent of Some/None) or error codes in Julia, the most common idiom is to use exceptions. And if you use things in Base, like arrays or /, you're stuck with exceptions. I'm not a fan, but that's fine -- plenty of reliable software uses exceptions for error handling.

The problem is that because the niche Julia occupies doesn't care2 about error handling, it's extremely difficult to write a robust Julia program. When you're writing smaller scripts, you often want to “fail-fast” to make debugging easier, but for some programs, you want the program to do something reasonable, keep running, and maybe log the error. It's hard to write a robust program, even for this weak definition of robust. There are problems at multiple levels. For the sake of space, I'll just list two.

If I'm writing something I'd like to be robust, I really want function documentation to include all exceptions the function might throw. Not only do the Julia docs not have that, it's common to call some function and get a random exception that has to do with an implementation detail and nothing to do with the API interface. Everything I've written that actually has to be reliable has been exception free, so maybe that's normal when people use exceptions? Seems pretty weird to me, though.

Another problem is that catching exceptions doesn't work (sometimes, at random). I ran into one bug where using exceptions caused code to be incorrectly optimized out. You might say that's not fair because it was caught using a fuzzer, and fuzzers are supposed to find bugs, but the fuzzer wasn't fuzzing exceptions or even expressions. The implementation of the fuzzer just happens to involve eval'ing function calls, in a loop, with a try/catch to handle exceptions. Turns out, if you do that, the function might not get called. This isn't a case of using a fuzzer to generate billions of tests, one of which failed. This was a case of trying one thing, one of which failed. That bug is now fixed, but there's still a nasty bug that causes exceptions to sometimes fail to be caught by catch, which is pretty bad news if you're putting something in a try/catch block because you don't want an exception to trickle up to the top level and kill your program.

When I grepped through Base to find instances of actually catching an exception and doing something based on the particular exception, I could only find a single one. Now, it's me scanning grep output in less, so I might have missed some instances, but it isn't common, and grepping through common packages finds a similar ratio of error handling code to other code. Julia folks don't care about error handling, so it's buggy and incomplete. I once asked about this and was told that it didn't matter that exceptions didn't work because you shouldn't use exceptions anyway -- you should use Erlang style error handling where you kill the entire process on an error and build transactionally robust systems that can survive having random processes killed. Putting aside the difficulty of that in a language that doesn't have Erlang's support for that kind of thing, you can easily spin up a million processes in Erlang. In Julia, if you load just one or two commonly used packages, firing up a single new instance of Julia can easily take half a minute or a minute. To spin up a million independent instances would at 30 seconds a piece would take approximately two years.

Since we're broadly on the topic of APIs, error conditions aren't the only place where the Base API leaves something to be desired. Conventions are inconsistent in many ways, from function naming to the order of arguments. Some methods on collections take the collection as the first argument and some don't (e.g., replace takes the string first and the regex second, whereas match takes the regex first and the string second).

More generally, Base APIs outside of the niche Julia targets often don't make sense. There are too many examples to list them all, but consider this one: the UDP interface throws an exception on a partial packet. This is really strange and also unhelpful. Multiple people stated that on this issue but the devs decided to throw the exception anyway. The Julia implementers have great intuition when it comes to linear algebra and other areas they're familiar with. But they're only human and their intuition isn't so great in areas they're not familiar with. The problem is that they go with their intuition anyway, even in the face of comments about how that might not be the best idea.

Another thing that's an issue for me is that I'm not in the audience the package manager was designed for. It's backed by git in a clever way that lets people do all sorts of things I never do. The result of all that is that it needs to do git status on each package when I run Pkg.status(), which makes it horribly slow; most other Pkg operations I care about are also slow for a similar reason.

That might be ok if it had the feature I most wanted, which is the ability to specify exact versions of packages and have multiple, conflicting, versions of packages installed3. Because of all the regressions in the core language libraries and in packages, I often need to use an old version of some package to make some function actually work, which can require old versions of its dependencies. There's no non-hacky way to do this.

Since I'm talking about issues where I care a lot more than the core devs, there's also benchmarking. The website shows off some impressive sounding speedup numbers over other languages. But they're all benchmarks that are pretty far from real workloads. Even if you have a strong background in workload characterization and systems architecture (computer architecture, not software architecture), it's difficult to generalize performance results on anything resembling real workload from microbenchmark numbers. From what I've heard, performance optimization of Julia is done from a larger set of similar benchmarks, which has problems for all of the same reasons. Julia is actually pretty fast, but this sort of ad hoc benchmarking basically guarantees that performance is being left on the table. Moreover, the benchmarks are written in a way that stacks the deck against other languages. People from other language communities often get rebuffed when they submit PRs to rewrite the benchmarks in their languages idiomatically. The Julia website claims that "all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom", and that making adjustments that are idiomatic for specific languages would be unfair. However, if you look at the Julia code, you'll notice that they're written in a way to avoid doing one of a number of things that would crater performance. If you follow the mailing list, you'll see that there are quite a few intuitive ways to write Julia code that has very bad performance. The Julia benchmarks avoid those pitfalls, but the code for other languages isn't written with anywhere near that care; in fact, it's just the opposite.

I've just listed a bunch of issues with Julia. I believe the canonical response for complaints about an open source project is, why don't you fix the bugs yourself, you entitled brat? Well, I tried that. For one thing, there are so many bugs that I often don't file bugs, let alone fix them, because it's too much of an interruption. But the bigger issue are the barriers to new contributors. I spent a few person-days fixing bugs (mostly debugging, not writing code) and that was almost enough to get me into the top 40 on GitHub's list of contributors. My point isn't that I contributed a lot. It's that I didn't, and that still put me right below the top 40.

There's lots of friction that keeps people from contributing to Julia. The build is often broken or has failing tests. When I polled Travis CI stats for languages on GitHub, Julia was basically tied for last in uptime. This isn't just a statistical curiosity: the first time I tried to fix something, the build was non-deterministically broken for the better part of a week because someone checked bad code directly into master without review. I spent maybe a week fixing a few things and then took a break. The next time I came back to fix something, tests were failing for a day because of another bad check-in and I gave up on the idea of fixing bugs. That tests fail so often is even worse than it sounds when you take into account the poor test coverage. And even when the build is "working", it uses recursive makefiles, and often fails with a message telling you that you need to run make clean and build again, which takes half an hour. When you do so, it often fails with a message telling you that you need to make clean all and build again, with takes an hour. And then there's some chance that will fail and you'll have to manually clean out deps and build again, which takes even longer. And that's the good case! The bad case is when the build fails non-deterministically. These are well-known problems that occur when using recursive make, described in Recursive Make Considered Harmful circa 1997.

And that's not even the biggest barrier to contributing to core Julia. The biggest barrier is that the vast majority of the core code is written with no markers of intent (comments, meaningful variable names, asserts, meaningful function names, explanations of short variable or function names, design docs, etc.). There's a tax on debugging and fixing bugs deep in core Julia because of all this. I happen to know one of the Julia core contributors (presently listed as the #2 contributor by GitHub's ranking), and when I asked him about some of the more obtuse functions I was digging around in, he couldn't figure it out either. His suggestion was to ask the mailing list, but for the really obscure code in the core codebase, there's perhaps one to three people who actually understand the code, and if they're too busy to respond, you're out of luck.

I don't mind spending my spare time working for free to fix other people's bugs. In fact, I do quite a bit of that and it turns out I often enjoy it. But I'm too old and crotchety to spend my leisure time deciphering code that even the core developers can't figure out because it's too obscure.

None of this is to say that Julia is bad, but the concerns of the core team are pretty different from my concerns. This is the point in a complain-y blog post where you're supposed to suggest an alternative or make a call to action, but I don't know that either makes sense here. The purely technical problems, like slow load times or the package manager, are being fixed or will be fixed, so there's not much to say there. As for process problems, like not writing tests, not writing internal documentation, and checking unreviewed and sometimes breaking changes directly into master, well, that's “easy”4 to fix by adding a code review process that forces people to write tests and documentation for code, but that's not free.

A small team of highly talented developers who can basically hold all of the code in their collective heads can make great progress while eschewing anything that isn't just straight coding at the cost of making it more difficult for other people to contribute. Is that worth it? It's hard to say. If you have to slow down Jeff, Keno, and the other super productive core contributors and all you get out of it is a couple of bums like me, that's probably not worth it. If you get a thousand people like me, that's probably worth it. The reality is in the ambiguous region in the middle, where it might or might not be worth it. The calculation is complicated by the fact that most of the benefit comes in the long run, whereas the costs are disproportionately paid in the short run. I once had an engineering professor who claimed that the answer to every engineering question is "it depends". What should Julia do? It depends.

2022 Update

This post originally mentioned how friendly the Julia community is, but I removed that since it didn't seem accurate in light of the responses. Many people were highly supportive, such as this Julia core developer:

However, a number of people had some pretty nasty responses and I don't think it's accurate to say that a community is friendly when the response is mostly positive, but with a significant fraction of nasty responses, since it doesn't really take a lot of nastiness to make a group seem unfriendly. Also, sentiment about this post has gotten more negative over time as communities tend to take their direction from the top and a couple of the Julia co-creators have consistently been quite negative about this post.

Now, onto the extent to which these issues have been fixed. The initial response from the co-founders was that the issues aren't really real and the post is badly mistaken. Over time, as some of the issues had some work done on them, the response changed to being that this post is out of date and the issues were all fixed, e.g., here's a response from one of the co-creators of Julia in 2016:

The main valid complaints in Dan's post were:

  1. Insufficient testing & coverage. Code coverage is now at 84% of base Julia, from somewhere around 50% at the time he wrote this post. While you can always have more tests (and that is happening), I certainly don't think that this is a major complaint at this point.

  2. Package issues. Julia now has package precompilation so package loading is pretty fast. The package manager itself was rewritten to use libgit2, which has made it much faster, especially on Windows where shelling out is painfully slow.

  3. Travis uptime. This is much better. There was a specific mystery issue going on when Dan wrote that post. That issue has been fixed. We also do Windows CI on AppVeyor these days.

  4. Documentation of Julia internals. Given the quite comprehensive developer docs that now exist, it's hard to consider this unaddressed: http://julia.readthedocs.org/en/latest/devdocs/julia/

So the legitimate issues raised in that blog post are fixed.

The top response to that is:

The main valid complaints [...] the legitimate issues raised [...]

This is a really passive-aggressive weaselly phrasing. I’d recommend reconsidering this type of tone in public discussion responses.

Instead of suggesting that the other complaints were invalid or illegitimate, you could just not mention them at all, or at least use nicer language in brushing them aside. E.g. “... the main actionable complaints...” or “the main technical complaints ...”

Putting aside issues of tone, I would say that the main issue from the post, the core team's attitude towards correctness, is both a legitimate issue and one that's unfixed, as we'll see when we look at how the specific issues mentioned as fixed are also unfixed.

On correctness, if the correctness issues were fixed, we wouldn't continue to see showstopping bugs in Julia, but I have a couple of friends who continued to use Julia for years until they got fed up with correctness issues and sent me quite a few bugs that they personally ran into that were serious well after the 2016 comment about correctness being fixed, such as getting an incorrect result when sampling from a distribution, sampling from an array produces incorrect results, the product function, i.e., multiplication, produces incorrect results, quantile produces incorrect results, mean produces incorrect results, incorrect array indexing, divide produces incorrect results, converting from float to int produces incorrect results, quantile produces incorrect results (again), mean produces incorrect results (again), etc.

There has been a continued flow of very serious bugs from Julia and numerous other people noting that they've run into serious bugs, such as here:

I remember all too un-fondly a time in which one of my Julia models was failing to train. I spent multiple months on-and-off trying to get it working, trying every trick I could think of.

Eventually – eventually! – I found the error: Julia/Flux/Zygote was returning incorrect gradients. After having spent so much energy wrestling with points 1 and 2 above, this was the point where I simply gave up. Two hours of development work later, I had the model successfully training… in PyTorch.

And here

I have been bit by incorrect gradient bugs in Zygote/ReverseDiff.jl. This cost me weeks of my life and has thoroughly shaken my confidence in the entire Julia AD landscape. [...] In all my years of working with PyTorch/TF/JAX I have not once encountered an incorrect gradient bug.

And here

Since I started working with Julia, I’ve had two bugs with Zygote which have slowed my work by several months. On a positive note, this has forced me to plunge into the code and learn a lot about the libraries I’m using. But I’m finding myself in a situation where this is becoming too much, and I need to spend a lot of time debugging code instead of doing climate research.

Despite this continued flow of bugs, public responses from the co-creators of Julia as well as a number of core community members generally claim, as they did for this post, that the issues will be fixed very soon (e.g., see the comments here by some core devs on a recent post, saying that all of the issues are being addressed and will be fixed soon, or this 2020 comment about how the there were serious correctness issues in 2016 but things are now good, etc.).

Instead of taking the correctness issues or other issues seriously, the developers make statements like the following comments from a co-creator of Julia, passed to me by a friend of mine as my friend ran into yet another showstopping bug:

takes that Julia doesn't take testing seriously... I don't get it. the amount of time and energy we spend on testing the bejeezus out of everything. I literally don't know any other open source project as thoroughly end-to-end tested.

The general claim is that, not only has Julia fixed its correctness issues, it's as good as it gets for correctness.

On the package issues, the claim was that package load times were fixed by 2016. But this continues to be a major complaint of the people I know who use Julia, e.g., Jamie Brandon switched away from using Julia in 2022 because it took two minutes for his CSV parsing pipeline to run, where most of the time was package loading. Another example is that, in 2020, on a benchmark where the Julia developers bragged that Julia is very fast at the curious workload of repeatedly loading the same CSV over and over again (in a loop, not by running a script repeatedly) compared to R, some people noted that this was unrealistic due to Julia's very long package load times, saying that it takes 2 seconds to open the CSV package and then 104 seconds to load a plotting library. In 2022, in response to comments that package loading is painfully slow, a Julia developer responds to each issue saying each one will be fixed; on package loading, they say

We're getting close to native code caching, and more: https://discourse.julialang.org/t/precompile-why/78770/8. As you'll also read, the difficulty is due to important tradeoffs Julia made with composability and aggressive specialization...but it's not fundamental and can be surmounted. Yes there's been some pain, but in the end hopefully we'll have something approximating the best of both worlds.

It's curious that these problems could exist in 2020 and 2022 after a co-creator of Julia claimed, in 2016, that the package load time problems were fixed. But this is the general pattern of Julia PR that we see. On any particular criticism, the criticism is one of: illegitimate, fixed soon or, when the criticism is more than a year old, already fixed. But we can see by looking at responses over time that the issues that are "already fixed" or "will be fixed soon" are, in fact, not fixed many years after claims that they were fixed. It's true that there is progress on the issues, but it wasn't really fair to say that package load time issues were fixed and "package loading is pretty fast" when it takes nearly two minutes to load a CSV and use a standard plotting library (an equivalent to ggplot2) to generate a plot in Julia. And likewise for correctness issues when there's still a steady stream of issues in core libraries, Julia itself, and libraries that are named as part of the magic that makes Julia great (e.g., autodiff is frequently named as a huge advantage of Julia when it comes to features, but then when it comes to bugs, those bugs don't count because they're not in Julia itself (that last comment, of course, has a comment from a Julia developer noting that all of the issues will be addressed soon).

There's a sleight of hand here where the reflexive response from a number of the co-creators as well as core developers of Julia is to brush off any particular issue with a comment that sounds plausible if read on HN or Twitter by someone who doesn't know people who've used Julia. This makes for good PR since, with an emerging language like Julia, most potential users won't have real connections who've used it seriously and the reflexive comments sound plausible if you don't look into them.

I use the word reflexive here because it seems that some co-creators of Julia respond to any criticism with a rebuttal, such as here, where a core developer responds to a post about showstopping bugs by saying that having bugs is actually good, and here, where in response to my noting that some people had commented that they were tired of misleading benchmarking practices by Julia developers, a co-creator of Julia drops in to say "I would like to let it be known for the record that I do not agree with your statements about Julia in this thread." But my statements in the thread were merely that there existed comments like https://news.ycombinator.com/item?id=24748582. It's quite nonsensical to state, for the record, a disagreement that those kinds of comments exist because they clearly do exist.

Another example of a reflexive response is this 2022 thread, where someone who tried Julia but stopped using it for serious work after running into one too many bugs that took weeks to debug suggests that the Julia ecosystem needs a rewrite because the attitude and culture in the community results in a large number of correctness issues. A core Julia developer "rebuts" the comment by saying that things are re-written all the time and gives examples of things that were re-written for performance reasons. Performance re-writes are, famously, a great way to introduce bugs, making the "rebuttal" actually a kind of anti-rebuttal. But, as is typical for many core Julia developers, the person saw that there was an issue (not enough re-writes) and reflexively responded with a denial, that there are enough re-writes.

These reflexive responses are pretty obviously bogus if you spend a bit of time reading them and looking at the historical context but this kind of "deny deny deny" response is generally highly effective PR and has been effective for Julia, so it's understandable that it's done. For example, on this 2020 comment that belies the 2016 comment about correctness being fixed that says that there were serious issues in 2016 but things are "now" good in 2020, someone responds "Thank you, this is very heartening." since it relieves them of their concern that there are still issues. Of course, you can see basically the same discussion on discussions in 2022, but people reading the discussion in 2022 generally won't go back to see that this same discussion happened in 2020, 2016, 2013, etc.

On the build uptime, the claim is that the issue causing uptime issues was fixed, but my comment there was on the attitude of brushing off the issue for an extended period of time with "works on my machine". As we can see from the examples above, the meta-issue of brushing off issues continued.

On the last issue that was claimed to legitimate, which was also claimed to be fixed, documentation, this is still a common complaint from the community, e.g., here in 2018, 2 years after it was claimed that documentation was fixed in 2016, here in 2019, here in 2022, etc. In a much lengthier complaint, one person notes

The biggest issue, and one they seem unwilling to really address, is that actually using the type system to do anything cool requires you to rely entirely on documentation which may or may not exist (or be up-to-date).

And another echoes this sentiment with

This is truly an important issue.

Of course, there's a response saying this will be fixed soon, as is generally the case. And yet, you can still find people complaining about the documentation.

If you go back and read discussions on Julia correctness issues, three more common defenses are that everything has bugs, bugs are quickly fixed, and testing is actually great because X is well tested. You can see examples of "everything has bugs" here in 2014 as well as here in 2022 (and in between as well, of course), as if all non-zero bug rates are the same, even though a number of developers have noted that they stopped using Julia for work and switched to other ecosystems because, while everything has bugs, all non-zero numbers are, of course, not the same. Bugs getting fixed quickly is sometimes not true (e.g., many of the bugs linked in this post have been open for quite a while and are still open) and is also a classic defense that's used to distract from the issue of practices that directly lead to the creation of an unusually large number of new bugs. As noted in a number of links, above, it can take weeks or months to debug correctness issues since many of the correctness issues are of the form "silently return incorrect results" and, as noted above, I ran into a bug where exceptions were non-deterministically incorrectly not caught. It may be true that, in some cases, these sorts of bugs are quickly fixed when found, but those issues still cost users a lot of time to track down. We saw an example of "testing is actually great because X is well tested" above. If you'd like a more recent example, here's one from 2022 where, in response to someone saying that ran into more correctness bugs in Julia than than in any other ecosystem they've used in their decades of programming, a core Julia dev responds by saying that a number of things are very well tested in Julia, such as libuv, as if testing some components well is a talisman that can be wielded against bugs in other components. This is obviously absurd, in that it's like saying that a building with an open door can't be insecure because it also has very sturdy walls, but it's a common defense used by core Julia developers. And, of course, there's also just straight-up FUD about writing about Julia. For example, in 2022, on Yuri Vishnevsky's post on Julia bugs, a co-creator of Julia said "Yuri's criticism was not that Julia has correctness bugs as a language, but that certain libraries when composed with common operations had bugs (many of which are now addressed).". This is, of course, completely untrue. In conversations with Yuri, he noted to me that he specifically included examples of core language and core library bugs because those happened so frequently, and it was frustrating that core Julia people pretended those didn't exist and that their FUD seemed to work since people would often respond as if their comments weren't untrue. As mentioned above, this kind of flat denial of simple matters of fact is highly effective, so it's understandable that people employ it but, personally, it's not to my taste.

To be clear, I don't inherently have a problem with software being buggy. As I've mentioned, I think move fast and break things can be a good value because it clearly states that velocity is more valued than correctness. Comments from the creators of Julia as well as core developers broadcast that Julia is not just highly reliable and correct, but actually world class ("the amount of time and energy we spend on testing the bejeezus out of everything. I literally don't know any other open source project as thoroughly end-to-end tested.", etc.). But, by revealed preference, we can see that Julia's values are "move fast and break things".

Appendix: blog posts on Julia

  • 2014: this post
  • 2016: Victor Zverovich
    • Julia brags about high performance in unrepresentative microbenchmarks but often has poor performance in practice
    • Complex codebase leading to many bugs
  • 2022: Volker Weissman
    • Poor documentation
    • Unclear / confusing error messages
    • Benchmarks claim good performance but benchmarks are of unrealistic workloads and performance is often poor in practice
  • 2022: Patrick Kidger comparison of Julia to JAX and PyTorch
    • Poor documentation
    • Correctness issues in widely relied on, important, libraries
    • Inscrutable error messages
    • Poor code quality, leading to bugs and other issues
  • 2022: Yuri Vishnevsky
    • Many very serious correctness bugs in both the language runtime and core libraries that are heavily relied on
    • Culture / attitude has persistently caused a large number of bugs, "Julia and its packages have the highest rate of serious correctness bugs of any programming system I’ve used, and I started programming with Visual Basic 6 in the mid-2000s"
      • Stream of serious bugs is in stark contrast to comments from core Julia developers and Julia co-creators saying that Julia is very solid and has great correctness properties

Thanks (or anti-thanks) to Leah Hanson for pestering me to write this for the past few months. It's not the kind of thing I'd normally write, but the concerns here got repeatedly brushed off when I brought them up in private. For example, when I brought up testing, I was told that Julia is better tested than most projects. While that's true in some technical sense (the median project on GitHub probably has zero tests, so any non-zero number of tests is above average), I didn't find that to be a meaningful rebuttal (as opposed to a reply that Julia is still expected to be mostly untested because it's in an alpha state). After getting a similar response on a wide array of topics I stopped using Julia. Normally that would be that, but Leah really wanted these concerns to stop getting ignored, so I wrote this up.

Also, thanks to Leah Hanson, Julia Evans, Joe Wilder, Eddie V, David Andrzejewski, and Yuri Vishnevsky for comments/corrections/discussion.


  1. What I mean here is that you can have lots of bugs pop up despite having 100% line coverage. It's not that line coverage is bad, but that it's not sufficient, not even close. And because it's not sufficient, it's a pretty bad sign when you not only don't have 100% line coverage, you don't even have 100% function coverage. [return]
  2. I'm going to use the word care a few times, and when I do I mean something specific. When I say care, I mean that in the colloquial revealed preference sense of the word. There's another sense of the word, in which everyone cares about testing and error handling, the same way every politician cares about family values. But that kind of caring isn't linked to what I care about, which involves concrete actions. [return]
  3. It's technically possible to have multiple versions installed, but the process is a total hack. [return]
  4. By "easy", I mean extremely hard. Technical fixes can be easy, but process and cultural fixes are almost always hard. [return]

Integer overflow checking cost

2014-12-17 08:00:00

How much overhead should we expect from enabling integer overflow checks? Using a compiler flag or built-in intrinsics, we should be able to do the check with a conditional branch that branches based on the overflow flag that add and sub set. Code that looks like

add     %esi, %edi

should turn into something like

add     %esi, %edi
jo      <handle_overflow>

Assuming that branch is always correctly predicted (which should be the case for most code), the costs of the branch are the cost of executing that correctly predicted not-taken branch, the pollution the branch causes in the branch history table, and the cost of decoding the branch (on x86, jo and jno don't fuse with add or sub, which means that on the fast path, the branch will take up one of the 4 opcodes that can come from the decoded instruction cache per cycle). That's probably less than a 2x penalty per add or sub on front-end limited in the worst case (which might happen in a tightly optimized loop, but should be rare in general), plus some nebulous penalty from branch history pollution which is really difficult to measure in microbenchmarks. Overall, we can use 2x as a pessimistic guess for the total penalty.

2x sounds like a lot, but how much time do applications spend adding and subtracting? If we look at the most commonly used benchmark of “workstation” integer workloads, SPECint, the composition is maybe 40% load/store ops, 10% branches, and 50% other operations. Of the 50% “other” operations, maybe 30% of those are integer add/sub ops. If we guesstimate that load/store ops are 10x as expensive as add/sub ops, and other ops are as expensive as add/sub, a 2x penalty on add/sub should result in a (40*10+10+50 + 12) / (40*10+10+50) = 3% penalty. That the penalty for a branch is 2x, that add/sub ops are only 10x faster than load/store ops, and that add/sub ops aren't faster than other "other" ops are all pessimistic assumptions, so this estimate should be on the high end for most workloads.

John Regehr, who's done serious analysis on integer overflow checks estimates that the penalty should be about 5%, which is in the same ballpark as our napkin sketch estimate.

A spec license costs $800, so let's benchmark bzip2 (which is a component of SPECint) instead of paying $800 for SPECint. Compiling bzip2 with clang -O3 vs. clang -O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow (which prints out a warning on overflow) vs. -fsanitize-undefined-trap-on-error with undefined overflow checks (which causes a crash on an undefined overflow), we get the following results on compressing and decompressing 1GB of code and binaries that happened to be lying around on my machine.

options zip (s) unzip (s) zip (ratio) unzip (ratio)
normal 93 45 1.0 1.0
fsan 119 49 1.28 1.09
fsan ud 94 45 1.01 1.00

In the table, ratio is the relative ratio of the run times, not the compression ratio. The difference between fsan ud, unzip and normal, unzip isn't actually 0, but it rounds to 0 if we measure in whole seconds. If we enable good error messages, decompression doesn't slow down all that much (45s v. 49s), but compression is a lot slower (93s v. 119s). The penalty for integer overflow checking is 28% for compression and 9% decompression if we print out nice diagnostics, but almost nothing if we don't. How is that possible? Bzip2 normally has a couple of unsigned integer overflows. If I patch the code to remove those so that the diagnostic printing code path is never executed it still causes a large performance hit.

Let's check out the penalty when we just do some adds with something like

for (int i = 0; i < n; ++i) {
  sum += a[i];
}

On my machine (a 3.4 GHz Sandy Bridge), this turns out to be about 6x slower with -fsanitize=signed-integer-overflow,unsigned-integer-overflow. Looking at the disassembly, the normal version uses SSE adds, whereas the fsanitize version uses normal adds. Ok, 6x sounds plausible for unchecked SSE adds v. checked adds.

But if I try different permutations of the same loop that don't allow the the compiler to emit SSE instructions for the unchecked version, I still get a 4x-6x performance penalty for versions compiled with fsanitize. Since there are a lot of different optimizations in play, including loop unrolling, let's take a look at a simple function that does a single add to get a better idea of what's going on.

Here's the disassembly for a function that adds two ints, first compiled with -O3 and then compiled with -O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow.

0000000000400530 <single_add>:
  400530:       01 f7                   add    %esi,%edi
  400532:       89 f8                   mov    %edi,%eax
  400534:       c3                      retq

The compiler does a reasonable job on the -O3 version. Per the standard AMD64 calling convention, the arguments are passed in via the esi and edi registers, and passed out via the eax register. There's some overhead over an inlined add instruction because we have to move the result to eax and then return from the function call, but considering that it's a function call, it's a totally reasonable implementation.

000000000041df90 <single_add>:
  41df90:       53                      push   %rbx
  41df91:       89 fb                   mov    %edi,%ebx
  41df93:       01 f3                   add    %esi,%ebx
  41df95:       70 04                   jo     41df9b <single_add+0xb>
  41df97:       89 d8                   mov    %ebx,%eax
  41df99:       5b                      pop    %rbx
  41df9a:       c3                      retq
  41df9b:       89 f8                   mov    %edi,%eax
  41df9d:       89 f1                   mov    %esi,%ecx
  41df9f:       bf a0 89 62 00          mov    $0x6289a0,%edi
  41dfa4:       48 89 c6                mov    %rax,%rsi
  41dfa7:       48 89 ca                mov    %rcx,%rdx
  41dfaa:       e8 91 13 00 00          callq  41f340
<__ubsan_handle_add_overflow>
  41dfaf:       eb e6                   jmp    41df97 <single_add+0x7>

The compiler does not do a reasonable job on the -O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow version. Optimization wizard Nathan Kurz, had this to say about clang's output:

That's awful (although not atypical) compiler generated code. For some reason the compiler decided that it wanted to use %ebx as the destination of the add. Once it did this, it has to do the rest. The question would by why it didn't use a scratch register, why it felt it needed to do the move at all, and what can be done to prevent it from doing so in the future. As you probably know, %ebx is a 'callee save' register, meaning that it must have the same value when the function returns --- thus the push and pop. Had the compiler just done the add without the additional mov, leaving the input in %edi/%esi as it was passed (and as done in the non-checked version), this wouldn't be necessary. I'd guess that it's a residue of some earlier optimization pass, but somehow the ghost of %ebx remained.

However, adding -fsanitize-undefined-trap-on-error changes this to

0000000000400530 <single_add>:
  400530:       01 f7                   add    %esi,%edi
  400532:       70 03                   jo     400537 <single_add+0x7>
  400534:       89 f8                   mov    %edi,%eax
  400536:       c3                      retq
  400537:       0f 0b                   ud2

Although this is a tiny, contrived, example, we can see a variety of mis-optimizations in other code compiled with options that allow fsanitize to print out diagnostics.

While a better C compiler could do better, in theory, gcc 4.82 doesn't do better than clang 3.4 here. For one thing, gcc's -ftrapv only checks signed overflow. Worse yet, it doesn't work, and this bug on ftrapv has been open since 2008. Despite doing fewer checks and not doing them correctly, gcc's -ftrapv slows things down about as much as clang's -fsanitize=signed-integer-overflow,unsigned-integer-overflow on bzip2, and substantially more than -fsanitize=signed-integer-overflow.

Summing up, integer overflow checks ought to cost a few percent on typical integer-heavy workloads, and they do, as long as you don't want nice error messages. The current mechanism that produces nice error messages somehow causes optimizations to get screwed up in a lot of cases1.

Update

On clang 3.8.0 and after, and gcc 5 and after, register allocation seems to work as expected (although you may need to pass -fno-sanitize-recover. I haven't gone back and re-run my benchmarks across different versions of clang and gcc, but I'd like to do that when I get some time.

CPU internals series

Thanks to Nathan Kurz for comments on this topic, including, but not limited to, the quote that's attributed to him, and to Stan Schwertly, Nick Bergson-Shilcock, Scott Feeney, Marek Majkowski, Adrian and Juan Carlos Borras for typo corrections and suggestions for clarification. Also, huge thanks to Richard Smith, who pointed out the -fsanitize-undefined-trap-on-error option to me. This post was updated with results for that option after Richard's comment. Also, thanks to Filipe Cabecinhas for noticing that clang fixed this behavior in clang 3.8 (released approximately 1.5 years after this post).

John Regehr has some more comments here on why clang's implementation of integer overflow checking isn't fast (yet).


  1. People often call for hardware support for integer overflow checking above and beyond the existing overflow flag. That would add expense and complexity to every chip made to get, at most, a few percent extra performance in the best case, on optimized code. That might be worth it -- there are lots of features Intel adds that only speed up a subset of applications by a few percent.

    This is often described as a chicken and egg problem; people would use overflow checks if checks weren't so slow, and hardware support is necessary to make the checks fast. But there's already hardware support to get good-enough performance for the vast majority of applications. It's just not taken advantage of because people don't actually care about this problem.

    [return]

Malloc tutorial

2014-12-04 08:00:00

Let's write a malloc and see how it works with existing programs!

This is basically an expanded explanation of what I did after reading this tutorial by Marwan Burelle and then sitting down and trying to write my own implementation, so the steps are going to be fairly similar. The main implementation differences are that my version is simpler and more vulnerable to memory fragmentation. In terms of exposition, my style is a lot more casual.

This tutorial is going to assume that you know what pointers are, and that you know enough C to know that *ptr dereferences a pointer, ptr->foo means (*ptr).foo, that malloc is used to dynamically allocate space, and that you're familiar with the concept of a linked list. For a basic intro to C, Pointers on C is one of my favorite books. If you want to look at all of this code at once, it's available here.

Preliminaries aside, malloc's function signature is

void *malloc(size_t size);

It takes as input a number of bytes and returns a pointer to a block of memory of that size.

There are a number of ways we can implement this. We're going to arbitrarily choose to use sbrk. The OS reserves stack and heap space for processes and sbrk lets us manipulate the heap. sbrk(0) returns a pointer to the current top of the heap. sbrk(foo) increments the heap size by foo and returns a pointer to the previous top of the heap.

Diagram of linux memory layout, courtesy of Gustavo Duarte.

If we want to implement a really simple malloc, we can do something like

#include <assert.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>


void *malloc(size_t size) {
  void *p = sbrk(0);
  void *request = sbrk(size);
  if (request == (void*) -1) {
    return NULL; // sbrk failed.
  } else {
    assert(p == request); // Not thread safe.
    return p;
  }
}

When a program asks malloc for space, malloc asks sbrk to increment the heap size and returns a pointer to the start of the new region on the heap. This is missing a technicality, that malloc(0) should either return NULL or another pointer that can be passed to free without causing havoc, but it basically works.

But speaking of free, how does free work? Free's prototype is

void free(void *ptr);

When free is passed a pointer that was previously returned from malloc, it's supposed to free the space. But given a pointer to something allocated by our malloc, we have no idea what size block is associated with it. Where do we store that? If we had a working malloc, we could malloc some space and store it there, but we're going to run into trouble if we need to call malloc to reserve space each time we call malloc to reserve space.

A common trick to work around this is to store meta-information about a memory region in some space that we squirrel away just below the pointer that we return. Say the top of the heap is currently at 0x1000 and we ask for 0x400 bytes. Our current malloc will request 0x400 bytes from sbrk and return a pointer to 0x1000. If we instead save, say, 0x10 bytes to store information about the block, our malloc would request 0x410 bytes from sbrk and return a pointer to 0x1010, hiding our 0x10 byte block of meta-information from the code that's calling malloc.

That lets us free a block, but then what? The heap region we get from the OS has to be contiguous, so we can't return a block of memory in the middle to the OS. Even if we were willing to copy everything above the newly freed region down to fill the hole, so we could return space at the end, there's no way to notify all of the code with pointers to the heap that those pointers need to be adjusted.

Instead, we can mark that the block has been freed without returning it to the OS, so that future calls to malloc can use re-use the block. But to do that we'll need be able to access the meta information for each block. There are a lot of possible solutions to that. We'll arbitrarily choose to use a single linked list for simplicity.

So, for each block, we'll want to have something like

struct block_meta {
  size_t size;
  struct block_meta *next;
  int free;
  int magic; // For debugging only. TODO: remove this in non-debug mode.
};

#define META_SIZE sizeof(struct block_meta)

We need to know the size of the block, whether or not it's free, and what the next block is. There's a magic number here for debugging purposes, but it's not really necessary; we'll set it to arbitrary values, which will let us easily see which code modified the struct last.

We'll also need a head for our linked list:

void *global_base = NULL;

For our malloc, we'll want to re-use free space if possible, allocating space when we can't re-use existing space. Given that we have this linked list structure, checking if we have a free block and returning it is straightforward. When we get a request of some size, we iterate through our linked list to see if there's a free block that's large enough.

struct block_meta *find_free_block(struct block_meta **last, size_t size) {
  struct block_meta *current = global_base;
  while (current && !(current->free && current->size >= size)) {
    *last = current;
    current = current->next;
  }
  return current;
}

If we don't find a free block, we'll have to request space from the OS using sbrk and add our new block to the end of the linked list.

struct block_meta *request_space(struct block_meta* last, size_t size) {
  struct block_meta *block;
  block = sbrk(0);
  void *request = sbrk(size + META_SIZE);
  assert((void*)block == request); // Not thread safe.
  if (request == (void*) -1) {
    return NULL; // sbrk failed.
  }

  if (last) { // NULL on first request.
    last->next = block;
  }
  block->size = size;
  block->next = NULL;
  block->free = 0;
  block->magic = 0x12345678;
  return block;
}

As with our original implementation, we request space using sbrk. But we add a bit of extra space to store our struct, and then set the fields of the struct appropriately.

Now that we have helper functions to check if we have existing free space and to request space, our malloc is simple. If our global base pointer is NULL, we need to request space and set the base pointer to our new block. If it's not NULL, we check to see if we can re-use any existing space. If we can, then we do; if we can't, then we request space and use the new space.

void *malloc(size_t size) {
  struct block_meta *block;
  // TODO: align size?

  if (size <= 0) {
    return NULL;
  }

  if (!global_base) { // First call.
    block = request_space(NULL, size);
    if (!block) {
      return NULL;
    }
    global_base = block;
  } else {
    struct block_meta *last = global_base;
    block = find_free_block(&last, size);
    if (!block) { // Failed to find free block.
      block = request_space(last, size);
      if (!block) {
        return NULL;
      }
    } else {      // Found free block
      // TODO: consider splitting block here.
      block->free = 0;
      block->magic = 0x77777777;
    }
  }

  return(block+1);
}

For anyone who isn't familiar with C, we return block+1 because we want to return a pointer to the region after block_meta. Since block is a pointer of type struct block_meta, +1 increments the address by one sizeof(struct block_meta).

If we just wanted a malloc without a free, we could have used our original, much simpler malloc. So let's write free! The main thing free needs to do is set ->free.

Because we'll need to get the address of our struct in multiple places in our code, let's define this function.

struct block_meta *get_block_ptr(void *ptr) {
  return (struct block_meta*)ptr - 1;
}

Now that we have that, here's free:

void free(void *ptr) {
  if (!ptr) {
    return;
  }

  // TODO: consider merging blocks once splitting blocks is implemented.
  struct block_meta* block_ptr = get_block_ptr(ptr);
  assert(block_ptr->free == 0);
  assert(block_ptr->magic == 0x77777777 || block_ptr->magic == 0x12345678);
  block_ptr->free = 1;
  block_ptr->magic = 0x55555555;
}

In addition to setting ->free, it's valid to call free with a NULL ptr, so we need to check for NULL. Since free shouldn't be called on arbitrary addresses or on blocks that are already freed, we can assert that those things never happen.

You never really need to assert anything, but it often makes debugging a lot easier. In fact, when I wrote this code, I had a bug that would have resulted in silent data corruption if these asserts weren't there. Instead, the code failed at the assert, which make it trivial to debug.

Now that we've got malloc and free, we can write programs using our custom memory allocator! But before we can drop our allocator into existing code, we'll need to implement a couple more common functions, realloc and calloc. Calloc is just malloc that initializes the memory to 0, so let's look at realloc first. Realloc is supposed to adjust the size of a block of memory that we've gotten from malloc, calloc, or realloc.

Realloc's function prototype is

void *realloc(void *ptr, size_t size)

If we pass realloc a NULL pointer, it's supposed to act just like malloc. If we pass it a previously malloced pointer, it should free up space if the size is smaller than the previous size, and allocate more space and copy the existing data over if the size is larger than the previous size.

Everything will still work if we don't resize when the size is decreased and we don't free anything, but we absolutely have to allocate more space if the size is increased, so let's start with that.

void *realloc(void *ptr, size_t size) {
  if (!ptr) {
    // NULL ptr. realloc should act like malloc.
    return malloc(size);
  }

  struct block_meta* block_ptr = get_block_ptr(ptr);
  if (block_ptr->size >= size) {
    // We have enough space. Could free some once we implement split.
    return ptr;
  }

  // Need to really realloc. Malloc new space and free old space.
  // Then copy old data to new space.
  void *new_ptr;
  new_ptr = malloc(size);
  if (!new_ptr) {
    return NULL; // TODO: set errno on failure.
  }
  memcpy(new_ptr, ptr, block_ptr->size);
  free(ptr);
  return new_ptr;
}

And now for calloc, which just clears the memory before returning a pointer.

void *calloc(size_t nelem, size_t elsize) {
  size_t size = nelem * elsize; // TODO: check for overflow.
  void *ptr = malloc(size);
  memset(ptr, 0, size);
  return ptr;
}

Note that this doesn't check for overflow in nelem * elsize, which is actually required by the spec. All of the code here is just enough to get something that kinda sorta works.

Now that we have something that kinda works, we can use our with existing programs (and we don't even need to recompile the programs)!

First, we need to compile our code. On linux, something like

clang -O0 -g -W -Wall -Wextra -shared -fPIC malloc.c -o malloc.so

should work.

-g adds debug symbols, so we can look at our code with gdb or lldb. -O0 will help with debugging, by preventing individual variables from getting optimized out. -W -Wall -Wextra adds extra warnings. -shared -fPIC will let us dynamically link our code, which is what lets us use our code with existing binaries!

On macs, we'll want something like

clang -O0 -g -W -Wall -Wextra -dynamiclib malloc.c -o malloc.dylib

Note that sbrk is deprecated on recent versions of OS X. Apple uses an unorthodox definition of deprecated -- some deprecated syscalls are badly broken. I didn't really test this on a Mac, so it's possible that this will cause weird failures or or just not work on a mac.

Now, to use get a binary to use our malloc on linux, we'll need to set the LD_PRELOAD environment variable. If you're using bash, you can do that with

export LD_PRELOAD=/absolute/path/here/malloc.so

If you've got a mac, you'll want

export DYLD_INSERT_LIBRARIES=/absolute/path/here/malloc.so

If everything works, you can run some arbitrary binary and it will run as normal (except that it will be a bit slower).

$ ls
Makefile  malloc.c  malloc.so  README.md  test  test-0  test-1  test-2  test-3  test-4

If there's a bug, you might get something like

$ ls
Segmentation fault (core dumped)

Debugging

Let's talk about debugging! If you're familiar with using a debugger to set breakpoints, inspect memory, and step through code, you can skip this section and go straight to the exercises.

This section assumes you can figure out how to install gdb on your system. If you're on a mac, you may want to just use lldb and translate the commands appropriately. Since I don't know what bugs you might run into, I'm going to introduce a couple of bugs and show how I'd track them down.

First, we need to figure out how to run gdb without having it segfault. If ls segfaults, and we try to run gdb ls, gdb is almost certainly going to segfault, too. We could write a wrapper to do this, but gdb also supports this. If we start gdb and then run set environment LD_PRELOAD=./malloc.so before running the program, LD_PRELOAD will work as normal.

$ gdb /bin/ls
(gdb) set environment LD_PRELOAD=./malloc.so
(gdb) run
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bd7dbd in free (ptr=0x0) at malloc.c:113
113       assert(block_ptr->free == 0);

As expected, we get a segfault. We can look around with list to see the code near the segfault.

(gdb) list
108     }
109
110     void free(void *ptr) {
111       // TODO: consider merging blocks once splitting blocks is implemented.
112       struct block_meta* block_ptr = get_block_ptr(ptr);
113       assert(block_ptr->free == 0);
114       assert(block_ptr->magic == 0x77777777 || block_ptr->magic == 0x12345678);
115       block_ptr->free = 1;
116       block_ptr->magic = 0x55555555;
117     }

And then we can use p (for print) to see what's going on with the variables here:

(gdb) p ptr
$6 = (void *) 0x0
(gdb) p block_ptr
$7 = (struct block_meta *) 0xffffffffffffffe8

ptr is 0, i.e., NULL, which is the cause of the problem: we forgot to check for NULL.

Now that we've figured that out, let's try a slightly harder bug. Let's say that we decided to replace our struct with

struct block_meta {
  size_t size;
  struct block_meta *next;
  int free;
  int magic;    // For debugging only. TODO: remove this in non-debug mode.
  char data[1];
};

and then return block->data instead of block+1 from malloc, with no other changes. This seems pretty similar to what we're already doing -- we just define a member that points to the end of the struct, and return a pointer to that.

But here's what happens if we try to use our new malloc:

$ /bin/ls
Segmentation fault (core dumped)
gdb /bin/ls
(gdb) set environment LD_PRELOAD=./malloc.so
(gdb) run

Program received signal SIGSEGV, Segmentation fault.
_IO_vfprintf_internal (s=s@entry=0x7fffff7ff5f0, format=format@entry=0x7ffff7567370 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", ap=ap@entry=0x7fffff7ff718) at vfprintf.c:1332
1332    vfprintf.c: No such file or directory.
1327    in vfprintf.c

This isn't as nice as our last error -- we can see that one of our asserts failed, but gdb drops us into some print function that's being called when the assert fails. But that print function uses our buggy malloc and blows up!

One thing we could do from here would be to inspect ap to see what assert was trying to print:

(gdb) p *ap
$4 = {gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffff7ff7f0, reg_save_area = 0x7fffff7ff730}

That would work fine; we could poke around until we figure out what's supposed to get printed and figure out the fail that way. Some other solutions would be to write our own custom assert or to use the right hooks to prevent assert from using our malloc.

But in this case, we know there are only a few asserts in our code. The one in malloc checking that we don't try to use this in a multithreaded program and the two in free checking that we're not freeing something we shouldn't. Let's look at free first, by setting a breakpoint.

$ gdb /bin/ls
(gdb) set environment LD_PRELOAD=./malloc.so
(gdb) break free
Breakpoint 1 at 0x400530
(gdb) run /bin/ls

Breakpoint 1, free (ptr=0x61c270) at malloc.c:112
112       if (!ptr) {

block_ptr isn't set yet, but if we use s a few times to step forward to after it's set, we can see what the value is:

(gdb) s
(gdb) s
(gdb) s
free (ptr=0x61c270) at malloc.c:118
118       assert(block_ptr->free == 0);
(gdb) p/x *block_ptr
$11 = {size = 0, next = 0x78, free = 0, magic = 0, data = ""}

I'm using p/x instead of p so we can see it in hex. The magic field is 0, which should be impossible for a valid struct that we're trying to free. Maybe get_block_ptr is returning a bad offset? We have ptr available to us, so we can just inspect different offsets. Since it's a void *, we'll have to cast it so that gdb knows how to interpret the results.

(gdb) p sizeof(struct block_meta)
$12 = 32
(gdb) p/x *(struct block_meta*)(ptr-32)
$13 = {size = 0x0, next = 0x78, free = 0x0, magic = 0x0, data = {0x0}}
(gdb) p/x *(struct block_meta*)(ptr-28)
$14 = {size = 0x7800000000, next = 0x0, free = 0x0, magic = 0x0, data = {0x78}}
(gdb) p/x *(struct block_meta*)(ptr-24)
$15 = {size = 0x78, next = 0x0, free = 0x0, magic = 0x12345678, data = {0x6e}}

If we back off a bit from the address we're using, we can see that the correct offset is 24 and not 32. What's happening here is that structs get padded, so that sizeof(struct block_meta) is 32, even though the last valid member is at 24. If we want to cut out that extra space, we need to fix get_block_ptr.

That's it for debugging!

Exercises

Personally, this sort of thing never sticks with me unless I work through some exercises, so I'll leave a couple exercises here for anyone who's interested.

  1. malloc is supposed to return a pointer “which is suitably aligned for any built-in type”. Does our malloc do that? If so, why? If not, fix the alignment. Note that “any built-in type” is basically up to 8 bytes for C because SSE/AVX types aren't built-in types.

  2. Our malloc is really wasteful if we try to re-use an existing block and we don't need all of the space. Implement a function that will split up blocks so that they use the minimum amount of space necessary

  3. After doing 2, if we call malloc and free lots of times with random sizes, we'll end up with a bunch of small blocks that can only be re-used when we ask for small amounts of space. Implement a mechanism to merge adjacent free blocks together so that any consecutive free blocks will get merged into a single block.

  4. Find bugs in the existing code! I haven't tested this much, so I'm sure there are bugs, even if this basically kinda sorta works.

Resources

As noted above, there's Marwan Burelle tutorial.

For more on how Linux deals with memory management, see this post by Gustavo Duarte.

For more on how real-world malloc implementations work, dlmalloc and tcmalloc are both great reading. I haven't read the code for jemalloc, and I've heard that it's a bit more more difficult to understand, but it's also the most widely used high-performance malloc implementation around.

For help debugging, Address Sanitizer is amazing. If you want to write a thread-safe version, Thread Sanitizer is also a great tool.

There's a Spanish translation of this post here thanks to Matias Garcia Isaia.

Acknowledgements

Thanks to Gustavo Duarte for letting me use one of his images to illustrate sbrk, and to Ian Whitlock, Danielle Sucher, Nathan Kurz, "tedu", @[email protected], and David Farrel for comments/corrections/discussion. Please let me know if you find other bugs in this post (whether they're in the writing or the code).

Markets, discrimination, and "lowering the bar"

2014-12-01 08:00:00

Public discussions of discrimination in tech often result in someone claiming that discrimination is impossible because of market forces. Here's a quote from Marc Andreessen that sums up a common view1.

Let's launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that's the case. ... No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They're like lying on the beach gasping because they can't get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.

Marc Andreessen's point is that the market is too competitive for discrimination to exist. But VC funded startups aren't the first companies in the world to face a competitive hiring market. Consider the market for PhD economists from, say, 1958 to 1987. Alan Greenspan had this to say about how that market looked to his firm, Townsend-Greenspan.

Townsend-Greenspan was unusual for an economics firm in that the men worked for the women (we had about twenty-five employees in all). My hiring of women economists was not motivated by women's liberation. It just made great business sense. I valued men and women equally, and found that because other employers did not, good women economists were less expensive than men. Hiring women . . . gave Townsend-Greenspan higher-quality work for the same money . . .

Not only did competition not end discrimination, there was enough discrimination that the act of not discriminating provided a significant competitive advantage for Townsend-Greenspan. And this is in finance, which is known for being cutthroat. And not just any part of finance, but one where it's PhD economists hiring other PhD economists. This is one of the industries where the people doing the hiring are the most likely to be familiar with both the theoretical models and the empirical research showing that discrimination opens up market opportunities by suppressing wages of some groups. But even that wasn't enough to equalize wages between men and women when Greenspan took over Townsend-Greenspan in 1958 and it still wasn't enough when Greenspan left to become chairman of the Fed in 1987. That's the thing about discrimination. When it's part of a deep-seated belief, it's hard for people to tell that they're discriminating.

And yet, in discussions on tech hiring, people often claim that, since markets and hiring are perfectly competitive or efficient, companies must already be hiring the best people presented to them. A corollary of this is that anti-discrimination or diversity oriented policies necessarily mean "lowering the bar since these would mean diverging from existing optimal hiring practices. And conversely, even when "market forces" aren't involved in the discussion, claiming that increasing hiring diversity necessarily means "lowering the bar" relies on an assumption of a kind of optimality in hiring. I think that an examination of tech hiring practices makes it pretty clear that practices are far from optimal, but rather than address this claim based on practices (which has been done in the linked posts), I'd look to look at the meta-claim that market forces make discrimination impossible. People make vauge claims about market efficiency and economics, like this influential serial founder who concludes his remarks on hiring with "Capitalism is real and markets are efficient."2. People seem to love handwave-y citations of "the market" or "economists".

But if we actually read what economists have to say on how hiring markets work, they do not, in general, claim that markets are perfectly efficient or that discrimination does not occur in markets that might colloquially be called highly competitive. Since we're talking about discrimination, a good place to start might be Becker's seminal work on discrimination. What Becker says is that markets impose a cost on discrimination, and that under certain market conditions, what Becker calls "taste-based"3 discrimination occuring on average doesn't mean there's discrimination at the margin. This is quite a specific statement and, if you read other papers in the literature on discrimination, they also make similarly specific statements. What you don't see is anything like the handwave-y claims in tech discussions, that "market forces" or "competition" is incompatible with discrimination or non-optimal hiring. Quite frankly, I've never had a discussion with someone who says things like "Capitalism is real and markets are efficient" where it appears that they have even a passing familiarity with Becker's seminal work in the field of the economics of discrimination or, for that matter, any other major work on the topic.

In discussions among the broader tech community, I have never seen anyone make a case that the tech industry (or any industry) meets the conditions under which taste-based discrimination on average doesn't imply marginal taste-based discrimination. Nor have I ever seen people make the case that we only have taste-based discrimination or that we also meet the conditions for not having other forms of discrimination. When people cite "efficient markets" with respect to hiring or other parts of tech, it's generally vague handwaving that sounds like an appeal to authority, but the authority is what someone might call a teenage libertarian's idea of how markets might behave.

Since people often don't find abstract reasoning of the kind you see in Becker's work convincing, let's look at a few concrete examples. You can see discrimination in a lot of fields. A problem is that it's hard to separate out the effect of discrimination from confounding variables because it's hard to get good data on employee performance v. compensation over time. Luckily, there's one set of fields where that data is available: sports. And before we go into the examples, it's worth noting that we should, directionally, expect much less discrimination in sports than in tech. Not only is there much better data available on employee performance, it's easier to predict future employee performance from past performance, the impact of employee performance on "company" performance is greater and easier quantify, and the market is more competitive. Relatively to tech, these forces both increase the cost of discrimination while making the cost more visible.

In baseball, Gwartney and Haworth (1974) found that teams that discriminated less against non-white players in the decade following de-segregation performed better. Studies of later decades using “classical” productivity metrics mostly found that salaries equalize. However, Swartz (2014), using newer and more accurate metrics for productivity, found that Latino players are significantly underpaid for their productivity level. Compensation isn't the only way to discriminate -- Jibou (1988) found that black players had higher exit rates from baseball after controlling for age and performance. This should sound familiar to anyone who's wondered about exit rates in tech fields.

This slow effect of the market isn't limited to baseball; it actually seems to be worse in other sports. A review article by Kahn (1991) notes that in basketball, the most recent studies (up to the date of the review) found an 11%-25% salary penalty for black players as well as a higher exit rate. Kahn also noted multiple studies showing discrimination against French-Canadians in hockey, which is believed to be due to stereotypes about how French-Canadian men are less masculine than other men4.

In tech, some people are concerned that increasing diversity will "lower the bar", but in sports, which has a more competitive hiring market than tech, we saw the opposite, increasing diversity raised the level instead of lowering it because it means hiring people on their qualifications instead of on what they look like. I don't disagree with people who say that it would be absurd for tech companies to leave money on the table by not hiring qualified minorities. But this is exactly what we saw in the sports we looked at, where that's even more absurd due to the relative ease of quantifying performance. And yet, for decades, teams left huge amounts of money on the table by favoring white players (and, in the case of hockey, non-French Canadian players) who were, quite simply, less qualified than their peers. The world is an absurd place.

In fields where there's enough data to see if there might be discrimination, we often find discrimination. Even in fields that are among the most competitive fields in existence, like major professional sports. Studies on discrimination aren't limited to empirical studies and data mining. There have been experiments showing discrimination at every level, from initial resume screening to phone screening to job offers to salary negotiation to workplace promotions. And those studies are mostly in fields where there's something resembling gender parity. In fields where discrimination is weak enough that there's gender parity or racial parity in entrance rates, we can see steadily decreasing levels of discrimination over the last two generations. Discrimination hasn't been eliminated, but it's much reduced.

Graph of enrollment by gender in med school, law school, the sciences, and CS. Graph courtesy of NPR.

And then we have computer science. The disparity in entrance rates is about what it was for medicine, law, and the physical sciences in the 70s. As it happens, the excuses for the gender disparity are the exact same excuses that were trotted out in the 70s to justify why women didn't want to go into or couldn't handle technical fields like medicine, economics, finance, and biology.

One argument that's commonly made is that women are inherently less interested in the "harder" sciences, so you'd expect more women to go into biology or medicine than programming. There are two major reasons I don't find that line of reasoning to be convincing. First, proportionally more women go into fields like math and chemical engineering than go into programming. I think it's pointless to rank math and the sciences by how "hard science" they are, but if you ask people to rank these things, most people will put math above programming and if they know what's involved in a chemical engineering degree, I think they'll also put chemical engineering above programming and yet those fields have proportionally more women than programming. Second, if you look at other countries, they have wildly different proportions of people who study computer science for reasons that seem to mostly be cultural. Given that we do see all of this variation, I don't see any reason to think that the U.S. reflects the "true" rate that women want to study programming and that countries where (proportionally) many more women want to study programming have rates that are distorted from the "true" rate by cultural biases.

Putting aside theoretical arguments, I wonder how it is that I've had such a different lived experience than Andreessen. His reasoning must sound reasonable in his head and stories of discrimination from women and minorities must not ring true. But to me, it's just the opposite.

Just the other day, I was talking to John (this and all other names were chosen randomly in order to maintain anonymity), a friend of mine who's a solid programmer. It took him two years to find a job, which is shocking in today's job market for someone my age, but sadly normal for someone like him, who's twice my age.

You might wonder if it's something about John besides his age, but when a Google coworker and I mock interviewed him he did fine. I did the standard interview training at Google and I interviewed for Google, and when I compare him to that bar, I'd say that his getting hired at Google would pretty much be a coin flip. Yes on a good day; no on a bad day. And when he interviewed at Google, he didn't get an offer, but he passed the phone screen and after the on-site they strongly suggested that he apply again in a year, which is a good sign. But most places wouldn't even talk to John.

And even at Google, which makes a lot of hay about removing bias from their processes, the processes often fail to do so. When I referred Mary to Google, she got rejected in the recruiter phone screen as not being technical enough and I saw William face increasing levels of ire from a manager because of a medical problem, which eventually caused him to quit.

Of course, in online discussions, people will call into question the technical competency of people like Mary. Well, Mary is one of the most impressive engineers I've ever met in any field. People mean different things when they say that, so let me provide a frame of reference: the other folks who fall into that category for me include an IBM Fellow, the person that IBM Fellow called the best engineer at IBM, a Math Olympiad medalist who's now a professor at CMU, a distinguished engineer at Sun, and a few other similar folks.

So anyway, Mary gets on the phone with a Google recruiter. The recruiter makes some comments about how Mary has a degree in math and not CS, and might not be technical enough, and questions Mary's programming experience: was it “algorithms” or “just coding”? It goes downhill from there.

Google has plenty of engineers without a CS degree, people with degrees in history, music, and the arts, and lots of engineers without any degree at all, not even a high school diploma. But somehow a math degree plus my internal referral mentioning that this was one of the best engineers I've ever seen resulted in the decision that Mary wasn't technical enough.

You might say that, like the example with John, this is some kind of a fluke. Maybe. But from what I've seen, if Mary were a man and not a woman, the odds of a fluke would have been lower.

This dynamic isn't just limited to hiring. I notice it every time I read the comments on one of Anna's blog posts. As often as not, someone will question Anna's technical chops. It's not even that they find a "well, actually" in the current post (although that sometimes happens); it's usually that they dig up some post from six months ago which, according to them, wasn't technical enough.

I'm no more technical than Anna, but I have literally never had that happen to me. I've seen it happen to men, but only those who are extremely high profile (among the top N most well-known tech bloggers, like Steve Yegge or Jeff Atwood), or who are pushing an agenda that's often condescended to (like dynamic languages). But it regularly happens to moderately well-known female bloggers like Anna.

Differential treatment of women and minorities isn't limited to hiring and blogging. I've lost track of the number of times a woman has offhandedly mentioned to me that some guy assumed she was a recruiter, a front-end dev, a wife, a girlfriend, or a UX consultant. It happens everywhere. At conferences. At parties full of devs. At work. Everywhere. Not only has that never happened to me, the opposite regularly happens to me -- if I'm hanging out with physics or math grad students, people assume I'm a fellow grad student.

When people bring up the market in discussions like these, they make it sound like it's a force of nature. It's not. It's just a word that describes the collective actions of people under some circumstances. Mary's situation didn't automatically get fixed because it's a free market. Mary's rejection by the recruiter got undone when I complained to my engineering director, who put me in touch with an HR director who patiently listened to the story and overturned the decision5. The market is just humans. It's humans all the way down.

We can fix this, if we stop assuming the market will fix it for us.

Also, note that although this post was originally published in 2014, it was updated in 2020 with links to some more recent comments and a bit of re-organization.

Thanks to Leah Hanson, Kelley Eskridge, Lindsey Kuper, Nathan Kurz, Scott Feeney, Katerina Barone-Adesi, Yuri Vishnevsky, @teles_dev, "Negative12DollarBill", and Patrick Roberts for feedback on this post, and to Julia Evans for encouraging me to post this when I was on the fence about writing this up publicly.

Note that all names in this post are aliases, taken from a list common names in the U.S. as of 1880.


  1. If you're curious what his “No. 1” was, it was that there can't be discrimination because just look at all the diversity we have. Chinese. Indians. Vietnamese. And so on. The argument is that it's not possible that we're discriminating against some groups because we're not discriminating against other groups. In particular, it's not possible that we're discriminating against groups that don't fit the stereotypical engineer mold because we're not discriminating against groups that do fit the stereotypical engineer mold. [return]
  2. See also, this comment by Benedict Evans "refuting" a comment that SV companies may have sub-optimal hiring practices for employees by saying "I don’t have to tell you that there is a ferocious war for talent in the valley.". That particular comment isn't one about diversity or discrimination, but the general idea that the SV job market somehow enfores a kind of optimality is pervasive among SV thought leaders. [return]
  3. "taste-based" discrimination is discrimination based on preferences that are unrelated to any actual productivity differences between groups that might exist. Of course, it's common for people to claim that they've never seen racism or sexism in some context, often with the implication and sometimes with an explicit claim that any differences we see are due to population level differences. If that were the case, we'd want to look at the literature on "statistical" discrimination. However, statistical discrimination doesn't seem like it should be relevant to this discussion. A contrived example of a case where statistical discrimination would be relevant is if we had to hire basketball players solely off of their height and weight with no ability to observe their play, either directly or statistically.

    In that case, teams would want to exclusively hire tall basketball players, since, if all you have to go on is height, height is a better proxy for basketball productivity than nothing. However, if we consider the non-contrived example of actual basketball productivity and compare the actual productivity of NBA basketball players vs. their height, there is (with the exception of outliers who are very unusually short for basketball players), no correlation between height and performance. The reason is that, if we can measure performance directly, we can simply hire based on performance, which takes height out of the performance equation. The exception to this is for very short players, who have to overcome biases (taste-based discrimination) that cause people to overlook them.

    While measure of programming productivity are quite poor, the actual statistical correlation between race and gender and productivity among the entire population is zero as best as anyone can tell, making statistical discrimination irrelevant.

    [return]
  4. The evidence here isn't totally unequivocal. In the review, Kahn notes that for some areas, there are early studies finding no pay gap, but those were done with small samples of players. Also, Kahn notes that (at the time), there wasn't enough evidence in football to say much either way. [return]
  5. In the interest of full disclosure, this didn't change the end result, since Mary didn't want to have anything to do with Google after the first interview. Given that the first interview went how it did, making that Mary's decision and not Google's was probably the best likely result, though, and from the comments I heard from the HR director, it sounded like there might be a lower probability of the same thing happening again in the future. [return]

TF-IDF linux commits

2014-11-24 08:00:00

I was curious what different people worked on in Linux, so I tried grabbing data from the current git repository to see if I could pull that out of commit message data. This doesn't include history from before they switched to git, so it only goes back to 2005, but that's still a decent chunk of history.

Here's a list of the most commonly used words (in commit messages), by the top four most frequent committers, with users ordered by number of commits.

User 1 2 3 4 5
viro to in of and the
tiwai alsa the - for to
broonie the to asoc for a
davem the to in and sparc64


Alright, so their most frequently used words are to, alsa, the, and the. Turns out, Takashi Iwai (tiwai) often works on audio (alsa), and by going down the list we can see that David Miller's (davem) fifth most frequently used term is sparc64, which is a pretty good indicator that he does a lot of sparc work. But the table is mostly noise. Of course people use to, in, and other common words all the time! Putting that into a table provides zero information.

There are a number of standard techniques for dealing with this. One is to explicitly filter out "stop words", common words that we don't care about. Unfortunately, that doesn't work well with this dataset without manual intervention. Standard stop-word lists are going to miss things like Signed-off-by and cc, which are pretty uninteresting. We can generate a custom list of stop words using some threshold for common words in commit messages, but any threshold high enough to catch all of the noise is also going to catch commonly used but interesting terms like null and driver.

Luckily, it only takes about a minute to do by hand. After doing that, the result is that many of the top words are the same for different committers. I won't reproduce the table of top words by committer because it's just many of the same words repeated many times. Instead, here's the table of the top words (ranked by number of commit messages that use the word, not raw count), with stop words removed, which has the same data without the extra noise of being broken up by committer.

Word Count
driver 49442
support 43540
function 43116
device 32915
arm 28548
error 28297
kernel 23132
struct 18667
warning 17053
memory 16753
update 16088
bit 15793
usb 14906
bug 14873
register 14547
avoid 14302
pointer 13440
problem 13201
x86 12717
address 12095
null 11555
cpu 11545
core 11038
user 11038
media 10857
build 10830
missing 10508
path 10334
hardware 10316


Ok, so there's been a lot of work on arm, lots of stuff related to memory, null, pointer, etc. But if want to see what individuals work on, we'll need something else.

That something else could be penalizing more common words without eliminating them entirely. A standard metric to normalize by is the inverse document frequency (IDF), log(# of messages / # of messages with word). So instead of ordering by term count or term frequency, let's try ordering by (term frequency) * log(# of messages / # of messages with word), which is commonly called TF-IDF1. This gives us words that one person used that aren't commonly used by other people.

Here's a list of the top 40 linux committers and their most commonly used words, according to TF-IDF.

User 1 2 3 4 5
viro switch annotations patch of endianness
tiwai alsa hda codec codecs hda-codec
broonie asoc regmap mfd regulator wm8994
davem sparc64 sparc we kill fix
gregkh cc staging usb remove hank
mchehab v4l/dvb media at were em28xx
tglx x86 genirq irq prepare shared
hsweeten comedi staging tidy remove subdevice
mingo x86 sched zijlstra melo peter
joe unnecessary checkpatch convert pr_ use
tj cgroup doesnt which it workqueue
lethal sh up off sh64 kill
axel.lin regulator asoc convert thus use
hch xfs sgi-pv sgi-modid remove we
sachin.kamat redundant remove simpler null of_match_ptr
bzolnier ide shtylyov sergei acked-by caused
alan tty gma500 we up et131x
ralf mips fix build ip27 of
johannes.berg mac80211 iwlwifi it cfg80211 iwlagn
trond.myklebust nfs nfsv4 sunrpc nfsv41 ensure
shemminger sky2 net_device_ops skge convert bridge
bunk static needlessly global patch make
hartleys comedi staging remove subdevice driver
jg1.han simpler device_release unnecessary clears thus
akpm cc warning fix function patch
rmk+kernel arm acked-by rather tested-by we
daniel.vetter drm/i915 reviewed-by v2 wilson vetter
bskeggs drm/nouveau drm/nv50 drm/nvd0/disp on chipsets
acme galbraith perf weisbecker eranian stephane
khali hwmon i2c driver drivers so
torvalds linux commit just revert cc
chris drm/i915 we gpu bugzilla whilst
neilb md array so that we
lars asoc driver iio dapm of
kaber netfilter conntrack net_sched nf_conntrack fix
dhowells keys rather key that uapi
heiko.carstens s390 since call of fix
ebiederm namespace userns hallyn serge sysctl
hverkuil v4l/dvb ivtv media v4l2 convert


That's more like it. Some common words still appear -- this would really be improved with manual stop words to remove things like cc and of. But for the most part, we can see who works on what. Takashi Iwai (tiwai) spends a lot of time in hda land and workig on codecs, David S. Miller (davem) has spent a lot of time on sparc64, Ralf Baechle (ralf) does a lot of work with mips, etc. And then again, maybe it's interesting that some, but not all, people cc other folks so much that it shows up in their top 5 list even after getting penalized by IDF.

We can also use this to see the distribution of what people talk about in their commit messages vs. how often they commit.

Who cares about null? These people!

This graph has people on the x-axis and relative word usage (ranked by TF-IDF) y-axis. On the x-axis, the most frequent committers on the left and least frequent on the right. On the y-axis, points are higher up if that committer used the word null more frequently, and lower if the person used the word null less frequently.

Who cares about POSIX? Almost nobody!

Relatively, almost no one works on POSIX compliance. You can actually count the individual people who mentioned POSIX in commit messages.

This is the point of the blog post where you might expect some kind of summary, or at least a vague point. Sorry. No such luck. I just did this because TF-IDF is one of a zillion concepts presented in the Mining Massive Data Sets course running now, and I knew it wouldn't really stick unless I wrote some code.

If you really must have a conclusion, TF-IDF is sometimes useful and incredibly easy to apply. You should use it when you should use it (when you want to see what words distinguish different documents/people from each other) and you shouldn't use it when you shouldn't use it (when you want to see what's common to documents/people). The end.

I'm experimenting with blogging more by spending less time per post and just spewing stuff out in 30-90 minute sitting. Please let me know if something is unclear or just plain wrong. Seriously. I went way over time on this one, but that's mostly because argh data and tables and bugs in Julia, not because of proofreading. I'm sure there are bugs!

Thanks to Leah Hanson for finding a bunch of writing bugs in this post and to Zack Maril for a conversation on how to maybe display change over time in the future.


  1. I actually don't understand why it's standard to take the log here. Sometimes you want to take the log so you can work with smaller numbers, or so that you can convert a bunch of multiplies into a bunch of adds, but neither of those is true here. Please let me know if this is obvious to you. [return]

One week of bugs

2014-11-18 08:00:00

If I had to guess, I'd say I probably work around hundreds of bugs in an average week, and thousands in a bad week. It's not unusual for me to run into a hundred new bugs in a single week. But I often get skepticism when I mention that I run into multiple new (to me) bugs per day, and that this is inevitable if we don't change how we write tests. Well, here's a log of one week of bugs, limited to bugs that were new to me that week. After a brief description of the bugs, I'll talk about what we can do to improve the situation. The obvious answer to spend more effort on testing, but everyone already knows we should do that and no one does it. That doesn't mean it's hopeless, though.

One week of bugs

Ubuntu

When logging into my machine, I got a screen saying that I entered my password incorrectly. After a five second delay, it logged me in anyway. This is probably at least two bugs, perhaps more.

GitHub

GitHub switched from Pygments to whatever they use for Atom, breaking syntax highlighting for most languages. The HN comments on this indicate that it's not just something that affects obscure languages; Java, PHP, C, and C++ all have noticeable breakage.

In a GitHub issue, a GitHub developer says

You're of course free to fork the Racket bundle and improve it as you see fit. I'm afraid nobody at GitHub works with Racket so we can't judge what proper highlighting looks like. But we'll of course pull your changes thanks to the magic of O P E N S O U R C E.

A bit ironic after the recent keynote talk by another GitHub employee titled “move fast and break nothing”. Not to mention that it's unlikely to work. The last time I submitted a PR to linguist, it only got merged after I wrote a blog post pointing out that they had 100s of open PRs, some of which were a year old, which got them to merge a bunch of PRs after the post hit reddit. As far as I can tell, "the magic of O P E N S O U R C E" is code for the magic of hitting the front page of reddit/HN or having lots of twitter followers.

Also, icons were broken for a while. Was that this past week?

LinkedIn

After replying to someone's “InMail”, I checked on it a couple days later, and their original message was still listed as unread (with no reply). Did it actually send my reply? I had no idea, until the other person responded.

Inbox

The Inbox app (not to be confused with Inbox App) notifies you that you have a new message before it actually downloads the message. It takes an arbitrary amount of time before the app itself gets the message, and refreshing in the app doesn't cause the message to download.

The other problem with notifications is that they sometimes don't show up when you get a message. About half the time I get a notification from the gmail app, I also get a notification from the Inbox app. The other half of the time, the notification is dropped.

Overall, I get a notification for a message that I can read maybe 1/3 of the time.

Google Analytics

Some locations near the U.S. (like Mexico City and Toronto) aren't considered worthy of getting their own country. The location map shows these cities sitting in the blue ocean that's outside of the U.S.

Octopress

Footnotes don't work correctly on the main page if you allow posts on the main page (instead of the index) and use the syntax to put something below the fold. Instead of linking to the footnote, you get a reference to anchor text that goes nowhere. This is in addition to the other footnote bug I already knew about.

Tags are only downcased in some contexts but not others, which means that any tags with capitalized letters (sometimes) don't work correctly. I don't even use tags, but I noticed this on someone else's blog.

My Atom feed doesn't work correctly.

If you consider performance bugs to be problems, I noticed so many of those this past week that they have their own blog post.

Running with Rifles (Game)

Weapons that are supposed to stun injure you instead. I didn't even realize that was a bug until someone mentioned that would be fixed in the next version.

It's possible to stab people through walls.

If you're holding a key when the level changes, your character keeps doing that action continuously during the next level, even after you've released the key.

Your character's position will randomly get out of sync from the server. When that happens, the only reliable fix I've found is to randomly shoot for a while. Apparently shooting causes the client to do something like send a snapshot of your position to the server? Not sure why that doesn't just happen regularly.

Vehicles can randomly spawn on top of you, killing you.

You can randomly spawn under a vehicle, killing you.

AI teammates don't consider walls or buildings when throwing grenades, which often causes them to kill themselves.

Grenades will sometimes damage the last vehicle you were in even when you're nowhere near the vehicle.

AI vehicles can get permanently stuck on pretty much any obstacle.

This is the first video game I've played in about 15 years. I tend to think of games as being pretty reliable, but that's probably because games were much simpler 15 years ago. MS Paint doesn't have many bugs, either.

Update: The sync issue above is caused by memory leaks. I originally thought that the game just had very poor online play code, but it turns out it's actually ok for the first 6 hours or so after a server restart. There are scripts around to restart the servers periodically, but they sometimes have bugs which cause them to stop running. When that happens on the official servers, the game basically becomes unplayable online.

Julia

Unicode sequence causes match/ismatch to blow up with a bounds error.

Unicode sequence causes using a string as a hash index to blow up with a bounds error.

Exception randomly not caught by catch. This sucks because putting things in a try/catch was the workaround for the two bugs above. I've seen other variants of this before; it's possible this shouldn't count as a new bug because it might be the same root cause as some bug I've already seen.

Function (I forget which) returns completely wrong results when given bad length arguments. You can even give it length arguments of the wrong type, and it will still “work” instead of throwing an exception or returning an error.

If API design bugs count, methods that work operation on iterables sometimes take the stuff as the first argument and sometimes don't. There are way too many of these to list. To take one example, match takes a regex first and a string second, whereas search takes a string first and a regex second. This week, I got bit by something similar on a numerical function.

And of course I'm still running into the 1+ month old bug that breaks convert, which is pervasive enough that anything that causes it to happen renders Julia unusable.

Here's one which might be an OS X bug? I had some bad code that caused an infinite loop in some Julia code. Nothing actually happened in the while loop, so it would just run forever. Oops. The bug is that this somehow caused my system to run out of memory and become unresponsive. Activity monitor showed that the kernel was taking an ever increasing amount of memory, which went away when I killed the Julia process.

I won't list bugs in packages because there are too many. Even in core Julia, I've run into so many Julia bugs that I don't file bugs any more. It's just too much of an interruption. When I have some time, I should spend a day filing all the bugs I can remember, but I think it would literally take a whole day to write up a decent, reproducible, bug report for each bug.

See this post for more on why I run into so many Julia bugs.

Google Hangouts

On starting a hangout: "This video call isn't available right now. Try again in a few minutes.".

Same person appears twice in contacts list. Both copies have the same email listed, and double clicking on either brings me to the same chat window.

UW Health

The latch mechanism isn't quite flush to the door on about 10% of lockers, so your locker won't actually be latched unless you push hard against the door while moving the latch to the closed position.

There's no visual (or other) indication that the latch failed to latch. As far as I can tell, the only way to check is to tug on the handle to see if the door opens after you've tried to latch it.

Coursera, Mining Massive Data Sets

Selecting the correct quiz answer gives you 0 points. The workaround (independently discovered by multiple people on the forums) is to keep submitting until the correct answer gives you 1 point. This is a week after a quiz had incorrect answer options which resulted in there being no correct answers.

Facebook

If you do something “wrong” with the mouse while scrolling down on someone's wall, the blue bar at the top can somehow transform into a giant block the size of your cover photo that doesn't go away as you scroll down.

Clicking on the activity sidebar on the right pops something that's under other UI elements, making it impossible to read or interact with.

Pandora

A particular station keeps playing electronic music, even though I hit thumbs down every time an electronic song comes on. The seed song was a song from a Disney musical.

Dropbox/Zulip

An old issue is that you can't disable notifications from @all mentions. Since literally none of them have been relevant to me for as long as I can remember, and @all notifications outnumber other notifications, it means that the majority of notifications I get are spam.

The new thing is that I tried muting the streams that regularly spam me, but the notification blows through the mute. My fix for that is that I've disabled all notifications, but now I don't get a notification if someone DMs me or uses @danluu.

Chrome

The Rust guide is unreadable with my version of chrome (no plug-ins).

Unreadable quoted blocks

Google Docs

I tried co-writing a doc with Rose Ames. Worked fine for me, but everything displayed as gibberish for her, so we switched to hackpad.

I didn't notice this until after I tried hackpad, but Docs is really slow. Hackpad feels amazingly responsive, but it's really just that Docs is laggy. It's the same feeling I had after I tried fastmail. Gmail doesn't seem slow until you use something that isn't slow.

Hackpad

Hours after the doc was created, it says “ROSE AMES CREATED THIS 1 MINUTE AGO.”

The right hand side list, which shows who's in the room, has a stack of N people even though there are only 2 people.

Rust

After all that, Rose and I worked through the Rust guide. I won't list the issues here because they're so long that our hackpad doc that's full of bugs is at least twice as long as this blog post. And this isn't a knock against the Rust docs, the docs are actually much better than for almost any other language.

WAT

What's going on here? If you include the bugs I'm not listing because the software is so buggy that listing all of the bugs would triple the length of this post, that's about 80 bugs in one week. And that's only counting bugs I hadn't seen before. How come there are so many bugs in everything?

A common response to this sort of comment is that it's open source, you ungrateful sod, why don't you fix the bugs yourself? I do fix some bugs, but there literally aren't enough hours in a week for me to debug and fix every bug I run into. There's a tragedy of the commons effect here. If there are only a few bugs, developers are likely to fix the bugs they run across. But if there are so many bugs that making a dent is hopeless, a lot of people won't bother.

I'm going to take a look at Julia because I'm already familiar with it, but I expect that it's no better or worse tested than most of these other projects (except for Chrome, which is relatively well tested). As a rough proxy for how much test effort has gone into it, it has 18k lines of test code. But that's compared to about 108k lines of code in src plus Base.

At every place I've worked, a 2k LOC prototype that exists just so you can get preliminary performance numbers and maybe play with the API is expected to have at least that much in tests because otherwise how do you know that it's not so broken that your performance estimates aren't off by an order of magnitude? Since complexity doesn't scale linearly in LOC, folks expect a lot more test code as the prototype gets bigger.

At 18k LOC in tests for 108k LOC of code, users are going to find bugs. A lot of bugs.

Here's where I'm supposed to write an appeal to take testing more seriously and put real effort into it. But we all know that's not going to work. It would take 90k LOC of tests to get Julia to be as well tested as a poorly tested prototype (falsely assuming linear complexity in size). That's two person-years of work, not even including time to debug and fix bugs (which probably brings it closer to four of five years). Who's going to do that? No one. Writing tests is like writing documentation. Everyone already knows you should do it. Telling people they should do it adds zero information1.

Given that people aren't going to put any effort into testing, what's the best way to do it?

Property-based testing. Generative testing. Random testing. Concolic Testing (which was done long before the term was coined). Static analysis. Fuzzing. Statistical bug finding. There are lots of options. Some of them are actually the same thing because the terminology we use is inconsistent and buggy. I'm going to arbitrarily pick one to talk about, but they're all worth looking into.

People are often intimidated by these, though. I've seen a lot of talks on these and they often make it sound like this stuff is really hard. Csmith is 40k LOC. American Fuzzy Lop's compile-time instrumentation is smart enough to generate valid JPEGs. Sixth Sense has the same kind of intelligence as American Fuzzy Lop in terms of exploration, and in addition, uses symbolic execution to exhaustively explore large portions of the state space; it will formally verify that your asserts hold if it's able to collapse the state space enough to exhaustively search it, otherwise it merely tries to get the best possible test coverage by covering different paths and states. In addition, it will use symbolic equivalence checking to check different versions of your code against each other.

That's all really impressive, but you don't need a formal methods PhD to do this stuff. You can write a fuzzer that will shake out a lot of bugs in an hour2. Seriously. I'm a bit embarrassed to link to this, but this fuzzer was written in about an hour and found 20-30 bugs3, including incorrect code generation, and crashes on basic operations like multiplication and exponentiation. My guess is that it would take another 2-3 hours to shake out another 20-30 bugs (with support for more types), and maybe another day of work to get another 20-30 (with very basic support for random expressions). I don't mention this because it's good. It's not. It's totally heinous. But that's the point. You can throw together an absurd hack in an hour and it will turn out to be pretty useful.

Compared to writing unit tests by hand: even if I knew what the bugs were in advance, I'd be hard pressed to code fast enough to generate 30 bugs in an hour. 30 bugs in a day? Sure, but not if I don't already know what the bugs are in advance. This isn't to say that unit testing isn't valuable, but if you're going to spend a few hours writing tests, a few hours writing a fuzzer is going to go a longer way than a few hours writing unit tests. You might be able to hit 100 words a minute by typing, but your CPU can easily execute 200 billion instructions a minute. It's no contest.

What does it really take to write a fuzzer? Well, you need to generate random inputs for a program. In this case, we're generating random function calls in some namespace. Simple. The only reason it took an hour was because I don't really get Julia's reflection capabilities well enough to easily generate random types, which resulted in my writing the type generation stuff by hand.

This applies to a lot of different types of programs. Have a GUI? It's pretty easy to prod random UI elements. Read files or things off the network? Generating (or mutating) random data is straightforward. This is something anyone can do.

But this isn't a silver bullet. Lackadaisical testing means that your users will find bugs. However, even given that developers aren't going to spend nearly enough time on testing, we can do a lot better than we're doing right now.

Resources

There are a lot of great resources out there, but if you're just getting started, I found this description of types of fuzzers to be one of those most helpful (and simplest) things I've read.

John Regehr has a udacity course on software testing. I haven't worked through it yet (Pablo Torres just pointed to it), but given the quality of Dr. Regehr's writing, I expect the course to be good.

For more on my perspective on testing, there's this.

Acknowledgments

Thanks to Leah Hanson and Mindy Preston for catching writing bugs, to Steve Klabnik for explaining the cause/fix of the Chrome bug (bad/corrupt web fonts), and to Phillip Joseph for finding a markdown bug.

I'm experimenting with blogging more by spending less time per post and just spewing stuff out in 30-90 minute sitting. Please let me know if something is unclear or just plain wrong. Seriously.


  1. If I were really trying to convince you of this, I'd devote a post to the business case, diving into the data and trying to figure out the cost of bugs. The short version of that unwritten post is that response times are well studied and it's known that a 100ms of extra latency will cost you a noticeable amount of revenue. A 1s latency hit is a disaster. How do you think that compares to having your product not work at all?

    Compared to 100ms of latency, how bad is it when your page loads and then bugs out in a way that makes it totally unusable? What if it destroys user state and makes the user re-enter everything they wanted to buy into their cart? Removing one extra click is worth a huge amount of revenue, and now we're talking about adding 10 extra clicks or infinite latency to a random subset of users. And not a small subset, either. Want to stop lighting piles of money on fire? Write tests. If that's too much work, at least use the data you already have to find bugs.

    Of course it's sometimes worth it to light pile of money on fire. Maybe your rocket ship is powered by flaming piles of money. If you're a very rapidly growing startup, a 20% increase in revenue might not be worth that much. It could be better to focus on adding features that drive growth. The point isn't that you should definitely write more tests, it's that you should definitely do the math to see if you should write more tests.

    [return]
  2. Plus debugging time. [return]
  3. I really need to update the readme with more bugs. [return]

Speeding up this site by 50x

2014-11-17 08:00:00

I've seen all these studies that show how a 100ms improvement in page load time has a significant effect on page views, conversion rate, etc., but I'd never actually tried to optimize my site. This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?

Not having done this before, I didn't know what to do. But in a great talk on how the internet works, Dan Espeset suggested trying webpagetest; let's give it a shot.

Here's what it shows with my nearly stock Octopress setup1. The only changes I'd made were enabling Google Analytics, the social media buttons at the bottom of posts, and adding CSS styling for tables (which are, by default, unstyled and unreadable).

time to start rendering: 9.7s time to visual completion: 10.9s

12 seconds to the first page view! What happened? I thought static sites were supposed to be fast. The first byte gets there in less than half a second, but the page doesn't start rendering until 9 seconds later.

Lots of js, CSS, and fonts

Looks like the first thing that happens is that we load a bunch of js and CSS. Looking at the source, we have all this js in source/_includes/head.html.

<script src="{{ root_url }}/javascripts/modernizr-2.0.js"></script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>!window.jQuery && document.write(unescape('%3Cscript src="./javascripts/lib/jquery.min.js"%3E%3C/script%3E'))</script>
<script src="{{ root_url }}/javascripts/octopress.js" type="text/javascript"></script>
{% include google_analytics.html %}

I don't know anything about web page optimization, but Espeset mentioned that js will stall page loading and rendering. What if we move the scripts to source/_includes/custom/after_footer.html?

time to start rendering: 4.9s

That's a lot better! We've just saved about 4 seconds on load time and on time to start rendering.

Those script tags load modernizer, jquery, octopress.js, and some google analytics stuff. What is in this octopress.js anyway? It's mostly code to support stuff like embedding flash videos, delicious integration, and github repo integration. There are a few things that do get used for my site, but most of that code is dead weight.

Also, why are there multiple js files? Espeset also mentioned that connections are finite resources, and that we'll run out of simultaneous open connections if we have a bunch of different files. Let's strip out all of that unused js and combine the remaining js into a single file.

time to start rendering: 1.4s

Much better! But wait a sec. What do I need js for? As far as I can tell, the only thing my site is still using octopress's js for is so that you can push the right sidebar back and forth by clicking on it, and jquery and modernizer are only necessary for the js used in octopress. I never use that, and according to in-page analytics no one else does either. Let's get rid of it.

time to start rendering: .7s time to visual completion: 1.2s

That didn't change total load time much, but the browser started rendering sooner. We're down to having the site visually complete after 1.2s, compared to 9.6s initially -- an 8x improvement.

What's left? There's still some js for the twitter and fb widgets at the bottom of each post, but those all get loaded after things are rendered, so they don't really affect the user's experience, even though they make the “Load Time” number look bad.

That's a lot of fonts!

This is a pie chart of how many bytes of my page are devoted to each type of file. Apparently, the plurality of the payload is spent on fonts. Despite my reference post being an unusually image heavy blog post, fonts are 43.8% and images are such a small percentage that webpagetest doesn't even list the number. Doesn't my browser already have some default fonts? Can we just use those?

time to start rendering: .6s time to visual completion: .9s

Turns out, we can. The webpage is now visually complete in 0.9s -- a 12x improvement. The improvement isn't quite as dramatic for “Repeat View”2 -- it's only an 8.6x improvement there -- but that's still pretty good.

The one remaining “obvious” issue is that the header loads two css files, one of which isn't minified. This uses up two connections and sends more data than necessary. Minifying the other css file and combining them speeds this up even further.

time to start rendering: .6s time to visual completion: .7s

Time to visually complete is now 0.7s -- a 15.6x improvement3. And that's on a page that's unusually image heavy for my site.

Mostly image load time

At this point the only things that happen before the page starts displaying are:, loading the HTML, loading the one css file, and loading the giant image (reliability.png).

We've already minified the css, so the main thing left to do is to make giant image better. I already ran optipng -o7 -zm1-9 on all my images, but ImageOptim was able to shave off another 4% of the image, giving a slight improvement. Across all the images in all my posts, ImageOptim was able to reduce images by an additional 20% over optipng, but it didn't help much in this case.

I also tried specifying the size of the image to see if that would let the page render before the image was finished downloading, but it didn't result in much of a difference.

After that, I couldn't think of anything else to try, but webpagetest had some helpful suggestions.

Blargh github pages

Apparently, the server I'm on is slow (it gets a D in sending the first byte after the initial request). It also recommends caching static content, but when I look at the individual suggestions, they're mostly for widgets I don't host/control. I should use a CDN, but Github Pages doesn't put content on a CDN for bare domains unless you use a DNS alias record, and my DNS provider doesn't support alias records. That's two reasons to stop servering from Github Pages (or perhaps one reason to move off Github Pages and one reason to get another DNS provider), so I switched to Cloudflare, which shaved over 100ms off the time to first byte.

Note that if you use Cloudflare for a static site, you'll want to create a "Page Rule" and enable "Cache Everything". By default, Cloudflare doesn't cache HTML, which is sort of pointless on a static blog that's mostly HTML. If you've done the optimizations here, you'll also want to avoid their "Rocket Loader" thing which attempts to load js asynchronously by loading blocking javascript. "Rocket Loader" is like AMP, in that it can speed up large, bloated, websites, but is big enough that it slows down moderately optimized websites.

Here's what happened after I initally enabled Cloudflare without realizing that I needed to create a "Page Rule".

Cloudflare saves 80MB out of 1GB

That's about a day's worth of traffic in 2013. Initially, Cloudflare was serving my CSS and redirecting to Github Pages for the HTML. Then I inlined my CSS and Cloudflare literally did nothing. Overall, Cloudflare served 80MB out of 1GB of traffic because it was only caching images and this blog is relatively light on images.

I haven't talked about inlining CSS, but it's easy and gives a huge speedup on the first visit since it means only one connection is required to display the page, instead of two sequentialy connections. It's a disadvantage on future visits since it means that the CSS has to be re-downloaded for each page, but since most of my traffic is from people running across a single blog post, who don't click through to anything else, it's a net win. In _includes/head.html

<link href="{{ root_url }}/stylesheets/all.css" media="screen, projection" rel="stylesheet" type="text/css">

should change to

{\% include all.css %}

In addition, there's a lot of pointless cruft in the css. Removing the stuff that, as someone who doesn't know CSS can spot as pointless (like support for delicious, support for Firefox 3.5 and below, lines that firefox flags as having syntax errors such as no-wrap instead of nowrap) cuts down the remaining CSS by about half. There's a lot of duplication remaining and I expect that the CSS could be reduced by another factor of 4, but that would require actually knowing CSS. Just doing those things, we get down to .4s before the webpage is visually complete.

Inlining css

That's a 10.9/.4 = 27.5 fold speedup. The effect on mobile is a lot more dramatic; there, it's closer to 50x.

I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!4

Update

I found it too hard to futz around with trimming down the massive CSS file that comes with Octopress, so I removed all of the CSS and then added a few lines to allow for a nav bar. This makes almost no difference on the desktop benchmark above, but it's a noticable improvement for slow connections. The difference is quite dramatic for 56k connections as well as connections with high packetloss.

Starting the day I made this change, my analytics data shows a noticeable improvement in engagement and traffic. There are too many things confounded here to say what caused this change (performance increase, total lack of styling, etc.), but there are a couple of things find interesting about this. First, it seems to likely show that the advice that it's very important to keep line lengths short is incorrect since, if that had a very large impact, it would've overwhelmed the other changes and resulted in reduced engagement and not increased engagement. Second, despite the Octopress design being widely used and lauded (it appears to have been the most widely used blog theme for programmers when I started my blog), it appears to cause a blog (or at least this blog) to get less readership than literally having no styling at all. Having no styling is surely not optimal, but there's something a bit funny about no styling beating the at-the-time most widely used programmer blog styling, which means it likely also beat wordpress, svtble, blogspot, medium, etc., since those have most oof the same ingredients as Octopress.

Resources

Unfortunately, the video of the presentation I'm referring to is restricted RC alums. If you're an RC alum, check this out. Otherwise high-performance browser networking is great, but much longer.

Acknowledgements

Thanks to Leah Hanson, Daniel Espeset, and Hugo Jobling for comments/corrections/discussion.

I'm not a front-end person, so I might be totally off in how I'm looking at these benchmarks. If so, please let me know.


  1. From whatever version was current in September 2013. It's possible some of these issues have been fixed, but based on the extremely painful experience of other people who've tried to update their Octopress installs, it didn't seem worth making the attempt to get a newer version of Octopress. [return]
  2. Why is “Repeat View” slower than “First View”? [return]
  3. If you look at a video of loading the original vs. this version, the difference is pretty dramatic. [return]
  4. For example, slashdot takes 15s to load over FIOS. The tests shown above were done on Cable, which is substantially slower. [return]

Build uptime

2014-11-10 08:00:00

I've noticed that builds are broken and tests fail a lot more often on open source projects than on “work” projects. I wasn't sure how much of that was my perception vs. reality, so I grabbed the Travis CI data for a few popular categories on GitHub1.

Graph of build reliability. Props to nu, oryx, caffe, catalyst, and Scala.

For reference, at every place I've worked, two 9s of reliability (99% uptime) on the build would be considered bad. That would mean that the build is failing for over three and a half days a year, or seven hours per month. Even three 9s (99.9% uptime) is about forty-five minutes of downtime a month. That's kinda ok if there isn't a hard system in place to prevent people from checking in bad code, but it's quite bad for a place that's serious about having working builds.

By contrast, 2 9s of reliability is way above average for the projects I pulled data for2 -- only 8 of 40 projects are that reliable. Almost twice as many projects -- 15 of 40 -- don't even achieve one 9 of uptime. And my sample is heavily biased towards reliable projects. There are projects that were well-known enough to be “featured” in a hand curated list by GitHub. That's already biases the data right there. And then I only grabbed data from the projects that care enough about testing to set up TravisCI3, which introduces an even stronger bias.

To make sure I wasn't grabbing bad samples, I removed any initial set of failing tests (there are often a lot of fails as people try to set up Travis and have it misconfigured) and projects that that use another system for tracking builds that only have Travis as an afterthought (like Rust)4.

Why doesn't the build fail all the time at work? Engineers don't like waiting for someone else to unbreak the build and managers can do the back of the envelope calculation which says that N idle engineers * X hours of build breakage = $Y of wasted money.

But that same logic applies to open source projects! Instead of wasting dollars, contributor's time is wasted.

Web programmers are hyper-aware of how 100ms of extra latency on a web page load has a noticeable effect on conversion rate. Well, what's the effect on conversion rate when a potential contributor to your project spends 20 minutes installing dependencies and an hour building your project only to find the build is broken?

I used to dig through these kinds of failures to find the bug, usually assuming that it must be some configuration issue specific to my machine. But having spent years debugging failures I run into with make check on a clean build, I've found that it's often just that someone checked in bad code. Nowadays, if I'm thinking about contributing to a project or trying to fix a bug and the build doesn't work, I move on to another project.

The worst thing about regular build failures is that they're easy5 to prevent. Graydon Hoare literally calls keeping a clean build the “not rocket science rule”, and wrote an open source tool (bors) anyone can use to do not-rocket-science. And yet, most open source projects still suffer through broken and failed builds, along with the associated cost of lost developer time and lost developer “conversions”.

Please don't read too much into the individual data in the graph. I find it interesting that DevOps projects tend to be more reliable than languages, which tend to be more reliable than web frameworks, and that ML projects are all over the place (but are mostly reliable). But when it comes to individual projects, all sorts of stuff can cause a project to have bad numbers.

Thanks to Kevin Lynagh, Leah Hanson, Michael Smith, Katerina Barone-Adesi, and Alexey Romanov for comments.

Also, props to Michael Smith of Puppetlabs for a friendly ping and working through the build data for puppet to make sure there wasn't a bug in my scripts. This is one of my most maligned blog posts because no one wants to believe the build for their project is broken more often than the build for other projects. But even though it only takes about a minute to pull down the data for a project and sanity check it using the links in this post, only one person actually looked through the data with me, while a bunch of people told me how it must quite obviously be incorrect without ever checking the data.

This isn't to say that I don't have any bugs. This is a quick hack that probably has bugs and I'm always happy to get bugreports! But some non-bugs that have been repeatedly reported are getting data from all branches instead of the main branch, getting data for all PRs and not just code that's actually checked in to the main branch, and using number of failed builds instead of the amount of time that the build is down. I'm pretty sure that you can check that any of those claims are false in about the same amount of time that it takes to make the claim, but that doesn't stop people from making the claim.


  1. Categories determined from GitHub's featured projects lists, which seem to be hand curated. [return]
  2. Wouldn't it be nice if I had test coverage data, too? But I didn't try to grab it since this was a quick 30-minute project and coming up with cross-language test coverage comparisons isn't trivial. However, I spot checked some projects and the ones that do poorly conform to an engineering version of what Tyler Cowen calls "The Law of Below Averages" -- projects that often have broken/failed builds also tend to have very spotty test coverage. [return]
  3. I used the official Travis API script, modified to return build start time instead of build finish time. Even so, build start time isn't exactly the same as check-in time, which introduces some noise. Only data against the main branch (usually master) was used. Some data was incomplete because their script either got a 500 error from the Travis API server, or ran into a runtime syntax error. All errors happened with and without my modifications, which is pretty appropriate for this blog post.

    If you want to reproduce the results, apply this patch to the official script, run it with the appropriate options (usually with --branch master, but not always), and then aggregate the results. You can use this script, but if you don't have Julia it may be easier to just do it yourself.

    [return]
  4. I think I filtered all the projects that were actually using a different testing service out. Please let me know if there are any still in my list. This removed one project with one-tenth of a 9 and two projects with about half a 9. BTW, removing the initial Travis fails for these projects bumped some of them up between half a 9 and a full 9 and completely eliminated a project that's had failing Travis tests for over a year. The graph shown looks much better than the raw data, and it's still not good. [return]
  5. Easy technically. Hard culturally. Michael Smith brought up the issue of intermittent failures. When you get those, whether that's because the project itself is broken or because the CI build is broken, people will start checking in bad code. There are environments where people don't do that -- for the better part of a decade, I worked at a company where people would track down basically any test failure ever, even (or especially) if the failure was something that disappeared with no explanation. How do you convince people to care that much? That's hard.

    How do you convince people to use a system like bors, where you don't have to care to avoid breaking the build? That's much easier, though still harder than the technical problems involved in building bors.

    [return]

Literature review on the benefits of static types

2014-11-07 08:00:00

There are some pretty strong statements about types floating around out there. The claims range from the oft-repeated phrase that when you get the types to line up, everything just works, to “not relying on type safety is unethical (if you have an SLA)”1, "It boils down to cost vs benefit, actual studies, and mathematical axioms, not aesthetics or feelings", and I think programmers who doubt that type systems help are basically the tech equivalent of an anti-vaxxer. The first and last of these statements are from "types" thought leaders who are widely quoted. There are probably plenty of strong claims about dynamic languages that I'd be skeptical of if I heard them, but I'm not in the right communities to hear the stronger claims about dynamically typed languages. Either way, it's rare to see people cite actual evidence.

Let's take a look at the empirical evidence that backs up these claims.

Click here if you just want to see the summary without having to wade through all the studies. The summary of the summary is that most studies find very small effects, if any. However, the studies probably don't cover contexts you're actually interested in. If you want the gory details, here's each study, with its abstract, and a short blurb about the study.

A Large Scale Study of Programming Languages and Code Quality in Github; Ray, B; Posnett, D; Filkov, V; Devanbu, P

Abstract

What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we hasten to caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, e.g., the preference of certain personality types for functional, static and strongly typed languages.

Summary

The authors looked at the 50 most starred repos on github for each of the 20 most popular languages plus TypeScript (minus CSS, shell, and vim). For each of these projects, they looked at the languages used. The text in the body of the study doesn't support the strong claims made in the abstract. Additionally, the study appears to use a fundamentally flawed methodology that's not capable of revealing much information. Even if the methodology were sound, the study uses bogus data and has what Pinker calls the igon value problem.

As Gary Bernhardt points out, the authors of the study seem to confuse memory safety and implicit coercion and make other strange statements, such as

Advocates of dynamic typing may argue that rather than spend a lot of time correcting annoying static type errors arising from sound, conservative static type checking algorithms in compilers, it’s better to rely on strong dynamic typing to catch errors as and when they arise.

The study uses the following language classification scheme

Table of classifications

These classifications seem arbitrary and many people would disagree with some of these classifications. Since the results are based on aggregating results with respect to these categories, and the authors have chosen arbitrary classifications, this already makes the aggragated results suspect since they have a number of degrees of freedom here and they've made some odd choicses.

In order to get the language level results, the authors looked at commit/PR logs to determine how many bugs there were for each language used. As far as I can tell, open issues with no associated fix don't count towards the bug count. Only commits that are detected by their keyword search technique were counted. With this methodology, the number of bugs found will depend at least as strongly on the bug reporting culture as it does on the actual number of bugs found.

After determining the number of bugs, the authors ran a regression, controlling for project age, number of developers, number of commits, and lines of code.

Defect rate correlations

There are enough odd correlations here that, even if the methodology wasn't known to be flawed, I'd be skeptical that authors have captured a causal relationship. If you don't find it odd that Perl and Ruby are as reliable as each other and significantly more reliable than Erlang and Java (which are also equally reliable), which are significantly more reliable than Python, PHP, and C (which are similarly reliable), and that TypeScript is the safest language surveyed, then maybe this passes the sniff test for you, but even without reading further, this looks suspicious.

For example, Erlang and Go are rated as having a lot of concurrency bugs, whereas Perl and CoffeeScript are rated as having few concurrency bugs. Is it more plausible that Perl and CoffeeScript are better at concurrency than Erlang and Go or that people tend to use Erlang and Go more when they need concurrency? The authors note that Go might have a lot of concurrency bugs because there's a good tool to detect concurrency bugs in Go, but they don't explore reasons for most of the odd intermediate results.

As for TypeScript, Eirenarch has pointed out that the three projects they list as example TypeScript projects, which they call the "top three" TypeScript projects are bitcoin, litecoin, and qBittorrent). These are C++ projects. So the intermediate result appears to not be that TypeScript is reliable, but that projects mis-identified as TypeScript are reliable. Those projects are reliable because Qt translation files are identified as TypeScript and it turns out that, per line of code, giant dumps of config files from another project don't cause a lot of bugs. It's like saying that a project has few bugs per line of code because it has a giant README. This is the most blatant classification error, but it's far from the only one.

For example, of what they call the "top three" perl projects, one is showdown, a javascript project, and one is rails-dev-box, a shell script and a vagrant file used to launch a Rails dev environment. Without knowing anything about the latter project, one might expect it's not a perl project from its name, rails-dev-box, which correctly indicates that it's a rails related project.

Since this study uses Github's notoriously inaccurate code classification system to classify repos, it is, at best, a series of correlations with factors that are themselves only loosely correlated with actual language usage.

There's more analysis, but much of it is based on aggregating the table above into categories based on language type. Since I'm skeptical of these results, I'm at least as skeptical of any results based on aggregating these results. This section barely even scratches the surface of this study. Even with just a light skim, we see multiple serious flaws, any one of which would invalidate the results, plus numerous igon value problems. It appears that the authors didn't even look at the tables they put in the paper, since if they did, it would jump out that (just for example), they classified a project called "rails-dev-box" as one of the three biggest perl projects (it's a 70-line shell script used to spin up ruby/rails dev environments).

Do Static Type Systems Improve the Maintainability of Software Systems? An Empirical Study Kleinschmager, S.; Hanenberg, S.; Robbes, R.; Tanter, E.; Stefik, A.

Abstract

Static type systems play an essential role in contemporary programming languages. Despite their importance, whether static type systems influence human software development capabilities remains an open question. One frequently mentioned argument for static type systems is that they improve the maintainability of software systems - an often used claim for which there is no empirical evidence. This paper describes an experiment which tests whether static type systems improve the maintainability of software systems. The results show rigorous empirical evidence that static type are indeed beneficial to these activities, except for fixing semantic errors.

Summary

While the abstract talks about general classes of languages, the study uses Java and Groovy.

Subjects were given classes in which they had to either fix errors in existing code or fill out stub methods. Static classes for Java, dynamic classes for Groovy. In cases of type errors (and their respective no method errors), developers solved the problem faster in Java. For semantic errors, there was no difference.

The study used a within-subject design, with randomized task order over 33 subjects.

A notable limitation is that the study avoided using “complicated control structures”, such as loops and recursion, because those increase variance in time-to-solve. As a result, all of the bugs are trivial bugs. This can be seen in the median time to solve the tasks, which are in the hundreds of seconds. Tasks can include multiple bugs, so the time per bug is quite low.

Groovy is both better and worse than Java

This paper mentions that its results contradict some prior results, and one of the possible causes they give is that their tasks are more complex than the tasks from those other papers. The fact that the tasks in this paper don't involve using loops and recursion because they're too complicated, should give you an idea of the complexity of the tasks involved in most of these papers.

Other limitations in this experiment were that the variables were artificially named such that there was no type information encoded in any of the names, that there were no comments, and that there was zero documentation on the APIs provided. That's an unusually hostile environment to find bugs in, and it's not clear how the results generalize if any form of documentation is provided.

Additionally, even though the authors specifically picked trivial tasks in order to minimize the variance between programmers, the variance between programmers was still much greater than the variance between languages in all but two tasks. Those two tasks were both cases of a simple type error causing a run-time exception that wasn't near the type error.

A controlled experiment to assess the benefits of procedure argument type checking, Prechelt, L.; Tichy, W.F.

Abstract

Type checking is considered an important mechanism for detecting programming errors, especially interface errors. This report describes an experiment to assess the defect-detection capabilities of static, intermodule type checking.

The experiment uses ANSI C and Kernighan & Ritchie (K&R) C. The relevant difference is that the ANSI C compiler checks module interfaces (i.e., the parameter lists calls to external functions), whereas K&R C does not. The experiment employs a counterbalanced design in which each of the 40 subjects, most of them CS PhD students, writes two nontrivial programs that interface with a complex library (Motif). Each subject writes one program in ANSI C and one in K&R C. The input to each compiler run is saved and manually analyzed for defects.

Results indicate that delivered ANSI C programs contain significantly fewer interface defects than delivered K&R C programs. Furthermore, after subjects have gained some familiarity with the interface they are using, ANSI C programmers remove defects faster and are more productive (measured in both delivery time and functionality implemented)

Summary

The “nontrivial” tasks are the inversion of a 2x2 matrix (with GUI) and a file “browser” menu that has two options, select file and display file. Docs for motif were provided, but example code was deliberately left out.

There are 34 subjects. Each subjects solves one problem with the K&R C compiler (which doesn't typecheck arguments) and one with the ANSI C compiler (which does).

The authors note that the distribution of results is non-normal, with highly skewed outliers, but they present their results as box plots, which makes it impossible to see the distribution. They do some statistical significance tests on various measures, and find no difference in time to completion on the first task, a significant difference on the second task, but no difference when the tasks are pooled.

ANSI C is better, except when it's worse

In terms of how the bugs are introduced during the programming process, they do a significance test against the median of one measure of defects (which finds a significant difference in the first task but not the second), and a significance test against the 75%-quantile of another measure (which finds a significant difference in the second task but not the first).

In terms of how many and what sort of bugs are in the final program, they define a variety of measures and find that some differences on the measures are statistically significant and some aren't. In the table below, bolded values indicate statistically significant differences.

Breakdown by various metrics

Note that here, first task refers to whichever task the subject happened to perform first, which is randomized, which makes the results seem rather arbitrary. Furthermore, the numbers they compare are medians (except where indicated otherwise), which also seems arbitrary.

Despite the strong statement in the abstract, I'm not convinced this study presents strong evidence for anything in particular. They have multiple comparisons, many of which seem arbitrary, and find that some of them are significant. They also find that many of their criteria don't have significant differences. Furthermore, they don't mention whether or not they tested any other arbitrary criteria. If they did, the results are much weaker than they look, and they already don't look strong.

My interpretation of this is that, if there is an effect, the effect is dwarfed by the difference between programmers, and it's not clear whether there's any real effect at all.

An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl, Prechelt, L.

Abstract

80 implementations of the same set of requirements are compared for several properties, such as run time, memory consumption, source text length, comment density, program structure, reliability, and the amount of effort required for writing them. The results indicate that, for the given programming problem, which regards string manipulation and search in a dictionary, “scripting languages” (Perl, Python, Rexx, Tcl) are more productive than “conventional languages” (C, C++, Java). In terms of run time and memory consumption, they often turn out better than Java and not much worse than C or C++. In general, the differences between languages tend to be smaller than the typical differences due to different programmers within the same language.

Summary

The task was to read in a list of phone numbers and return a list of words that those phone numbers could be converted to, using the letters on a phone keypad.

This study was done in two phases. There was a controlled study for the C/C++/Java group, and a self-timed implementation for the Perl/Python/Rexx/Tcl group. The former group consisted of students while the latter group consisted of respondents from a newsgroup. The former group received more criteria they should consider during implementation, and had to implement the program when they received the problem description, whereas some people in the latter group read the problem description days or weeks before implementation.

C and C++ are fast C, C++, and Java are slow to write

If you take the results at face value, it looks like the class of language used imposes a lower bound on both implementation time and execution time, but that the variance between programmers is much larger than the variance between languages.

However, since the scripting language group had significantly different (and easier) environment than the C-like language group, it's hard to say how much of the measured difference in implementation time is from flaws in the experimental design and how much is real.

Static type systems (sometimes) have a positive impact on the usability of undocumented software; Mayer, C.; Hanenberg, S.; Robbes, R.; Tanter, E.; Stefik, A.

Abstract

Static and dynamic type systems (as well as more recently gradual type systems) are an important research topic in programming language design. Although the study of such systems plays a major role in research, relatively little is known about the impact of type systems on software development. Perhaps one of the more common arguments for static type systems is that they require developers to annotate their code with type names, which is thus claimed to improve the documentation of software. In contrast, one common argument against static type systems is that they decrease flexibility, which may make them harder to use. While positions such as these, both for and against static type systems, have been documented in the literature, there is little rigorous empirical evidence for or against either position. In this paper, we introduce a controlled experiment where 27 subjects performed programming tasks on an undocumented API with a static type system (which required type annotations) as well as a dynamic type system (which does not). Our results show that for some types of tasks, programmers were afforded faster task completion times using a static type system, while for others, the opposite held. In this work, we document the empirical evidence that led us to this conclusion and conduct an exploratory study to try and theorize why.

Summary

The experimental setup is very similar to the previous Hanenberg paper, so I'll just describe the main difference, which is that subjects used either Java, or a restricted subset of Groovy that was equivalent to dynamically typed Java. Subjects were students who had previous experience in Java, but not Groovy, giving some advantage for the Java tasks.

Groovy is both better and worse than Java

Task 1 was a trivial warm-up task. The authors note that it's possible that Java is superior on task 1 because the subjects had prior experience in Java. The authors speculate that, in general, Java is superior to untyped Java for more complex tasks, but they make it clear that they're just speculating and don't have enough data to conclusively support that conclusion.

How Do API Documentation and Static Typing Affect API Usability? Endrikat, S.; Hanenberg, S.; Robbes, Romain; Stefik, A.

Abstract

When developers use Application Programming Interfaces (APIs), they often rely on documentation to assist their tasks. In previous studies, we reported evidence indicating that static type systems acted as a form of implicit documentation, benefiting developer productivity. Such implicit documentation is easier to maintain, given it is enforced by the compiler, but previous experiments tested users without any explicit documentation. In this paper, we report on a controlled experiment and an exploratory study comparing the impact of using documentation and a static or dynamic type system on a development task. Results of our study both confirm previous findings and show that the benefits of static typing are strengthened with explicit documentation, but that this was not as strongly felt with dynamically typed languages.

There's an earlier study in this series with the following abstract:

In the discussion about the usefulness of static or dynamic type systems there is often the statement that static type systems improve the documentation of software. In the meantime there exists even some empirical evidence for this statement. One of the possible explanations for this positive influence is that the static type system of programming languages such as Java require developers to write down the type names, i.e. lexical representations which potentially help developers. Because of that there is a plausible hypothesis that the main benefit comes from the type names and not from the static type checks that are based on these names. In order to argue for or against static type systems it is desirable to check this plausible hypothesis in an experimental way. This paper describes an experiment with 20 participants that has been performed in order to check whether developers using an unknown API already benefit (in terms of development time) from the pure syntactical representation of type names without static type checking. The result of the study is that developers do benefit from the type names in an API's source code. But already a single wrong type name has a measurable significant negative impact on the development time in comparison to APIs without type names.

The languages used were Java and Dart. The university running the tests teaches in Java, so subjects had prior experience in Java. The task was one “where participants use the API in a way that objects need to be configured and passed to the API”, which was chosen because the authors thought that both types and documentation should have some effect. “The challenge for developers is to locate all the API elements necessary to properly configure [an] object”. The documentation was free-form text plus examples.

Documentation is only helpful with types and vice versa?

Taken at face value, it looks like types+documentation is a lot better than having one or the other, or neither. But since the subjects were students at a school that used Java, it's not clear how much of the effect is from familiarity with the language and how much is from the language. Moreover, the task was a single task that was chosen specifically because it was the kind of task where both types and documentation were expected to matter.

An Experiment About Static and Dynamic Type Systems; Hanenberg, S.

Abstract

Although static type systems are an essential part in teaching and research in software engineering and computer science, there is hardly any knowledge about what the impact of static type systems on the development time or the resulting quality for a piece of software is. On the one hand there are authors that state that static type systems decrease an application's complexity and hence its development time (which means that the quality must be improved since developers have more time left in their projects). On the other hand there are authors that argue that static type systems increase development time (and hence decrease the code quality) since they restrict developers to express themselves in a desired way. This paper presents an empirical study with 49 subjects that studies the impact of a static type system for the development of a parser over 27 hours working time. In the experiments the existence of the static type system has neither a positive nor a negative impact on an application's development time (under the conditions of the experiment).

Summary

This is another Hanenberg study with a basically sound experimental design, so I won't go into details about the design. Some unique parts are that, in order to control for familiarity and other things that are difficult to control for with existing languages, the author created two custom languages for this study.

The author says that the language has similarities to Smalltalk, Ruby, and Java, and that the language is a class-based OO language with single implementation inheritance and late binding.

The students had 16 hours of training in the new language before starting. The author argues that this was sufficient because “the language, its API as well as its IDE was kept very simple”. An additional 2 hours was spent to explain the type system for the static types group.

There were two tasks, a “small” one (implementing a scanner) and a “large” one (implementing a parser). The author found a statistically significant difference in time to complete the small task (the dynamic language was faster) and no difference in the time to complete the large task.

There are a number of reasons this result may not be generalizable. The author is aware of them and there's a long section on ways this study doesn't generalize as well as a good discussion on threats to validity.

Work In Progress: an Empirical Study of Static Typing in Ruby; Daly, M; Sazawal, V; Foster, J.

Abstract

In this paper, we present an empirical pilot study of four skilled programmers as they develop programs in Ruby, a popular, dynamically typed, object-oriented scripting language. Our study compares programmer behavior under the standard Ruby interpreter versus using Diamondback Ruby (DRuby), which adds static type inference to Ruby. The aim of our study is to understand whether DRuby's static typing is beneficial to programmers. We found that DRuby's warnings rarely provided information about potential errors not already evident from Ruby's own error messages or from presumed prior knowledge. We hypothesize that programmers have ways of reasoning about types that compensate for the lack of static type information, possibly limiting DRuby's usefulness when used on small programs.

Summary

Subjects came from a local Ruby user's group. Subjects implemented a simplified Sudoku solver and a maze solver. DRuby was randomly selected for one of the two problems for each subject. There were four subjects, but the authors changed the protocol after the first subject. Only three subjects had the same setup.

The authors find no benefit to having types. This is one of the studies that the first Hanenberg study mentions as a work their findings contradict. That first paper claimed that it was because their tasks were more complex, but it seems to me that this paper has a more complex task. One possible reason they found contradictory results is that the effect size is small. Another is that the specific type systems used matter, and that a DRuby v. Ruby study doesn't generalize to Java v. Groovy. Another is that the previous study attempted to remove anything hinting at type information from the dynamic implementation, including names that indicate types and API documentation. The participants of this study mention that they get a lot of type information from API docs, and the authors note that the participants encode type information in their method names.

This study was presented in a case study format, with selected comments from the participants and an analysis of their comments. The authors note that participants regularly think about types, and check types, even when programming in a dynamic language.

Haskell vs. Ada vs. C++ vs. Awk vs. ... An Experiment in Software Prototyping Productivity; Hudak, P; Jones, M.

Abstract

We describe the results of an experiment in which several conventional programming languages, together with the functional language Haskell, were used to prototype a Naval Surface Warfare Center (NSWC) requirement for a Geometric Region Server. The resulting programs and development metrics were reviewed by a committee chosen by the Navy. The results indicate that the Haskell prototype took significantly less time to develop and was considerably more concise and easier to understand than the corresponding prototypes written in several different imperative languages, including Ada and C++.

Summary

Subjects were given an informal text description for the requirements of a geo server. The requirements were behavior oriented and didn't mention performance. The subjects were “expert” programmers in the languages they used. They were asked to implement a prototype and track metrics such as dev time, lines of code, and docs. Metrics were all self reported, and no guidelines were given as to how they should be measured, so metrics varied between subjects. Also, some, but not all, subjects attended a meeting where additional information was given on the assignment.

Due to the time-frame and funding requirements, the requirements for the server were extremely simple; the median implementation was a couple hundred lines of code. Furthermore, the panel that reviewed the solutions didn't have time to evaluate or run the code; they based their findings on the written reports and oral presentations of the subjects.

Table of LOC, dev time, and lines of code

This study hints at a very interesting result, but considering all of its limitations, the fact that each language (except Haskell) was only tested once, and that other studies show much larger intra-group variance than inter-group variance, it's hard to conclude much from this study alone.

Unit testing isn't enough. You need static typing too; Farrer, E

Abstract

Unit testing and static type checking are tools for ensuring defect free software. Unit testing is the practice of writing code to test individual units of a piece of software. By validating each unit of software, defects can be discovered during development. Static type checking is performed by a type checker that automatically validates the correct typing of expressions and statements at compile time. By validating correct typing, many defects can be discovered during development. Static typing also limits the expressiveness of a programming language in that it will reject some programs which are ill-typed, but which are free of defects.

Many proponents of unit testing claim that static type checking is an insufficient mechanism for ensuring defect free software; and therefore, unit testing is still required if static type checking is utilized. They also assert that once unit testing is utilized, static type checking is no longer needed for defect detection, and so it should be eliminated.

The goal of this research is to explore whether unit testing does in fact obviate static type checking in real world examples of unit tested software.

Summary

The author took four Python programs and translated them to Haskell. Haskell's type system found some bugs. Unlike academic software engineering research, this study involves something larger than a toy program and looks at a type system that's more expressive than Java's type system. The programs were the NMEA Toolkit (9 bugs), MIDITUL (2 bugs), GrapeFruit (0 bugs), and PyFontInfo (6 bugs).

As far as I can tell, there isn't an analysis of the severity of the bugs. The programs were 2324, 2253, 2390, and 609 lines long, respectively, so the bugs found / LOC were 17 / 7576 = 1 / 446. For reference, in Code Complete, Steve McConnell estimates that 15-50 bugs per 1kLOC is normal. If you believe that estimate applies to this codebase, you'd expect that this technique caught between 4% and 15% of the bugs in this code. There's no particular reason to believe the estimate should apply, but we can keep this number in mind as a reference in order to compare to a similarly generated number from another study that we'll get to later.

The author does some analysis on how hard it would have been to find the bugs through testing, but only considers line coverage directed unit testing; the author comments that bugs might have have been caught by unit testing if they could be missed with 100% line coverage. This seems artificially weak — it's generally well accepted that line coverage is a very weak notion of coverage and that testing merely to get high line coverage isn't sufficient. In fact, it is generally considered insufficient to even test merely to get high path coverage, which is a much stronger notion of coverage than line coverage.

Gradual Typing of Erlang Programs: A Wrangler Experience; Sagonas, K; Luna, D

Abstract

Currently most Erlang programs contain no or very little type information. This sometimes makes them unreliable, hard to use, and difficult to understand and maintain. In this paper we describe our experiences from using static analysis tools to gradually add type information to a medium sized Erlang application that we did not write ourselves: the code base of Wrangler. We carefully document the approach we followed, the exact steps we took, and discuss possible difficulties that one is expected to deal with and the effort which is required in the process. We also show the type of software defects that are typically brought forward, the opportunities for code refactoring and improvement, and the expected benefits from embarking in such a project. We have chosen Wrangler for our experiment because the process is better explained on a code base which is small enough so that the interested reader can retrace its steps, yet large enough to make the experiment quite challenging and the experiences worth writing about. However, we have also done something similar on large parts of Erlang/OTP. The result can partly be seen in the source code of Erlang/OTP R12B-3.

Summary

This is somewhat similar to the study in “Unit testing isn't enough”, except that the authors of this study created a static analysis tool instead of translating the program into another language. The authors note that they spent about half an hour finding and fixing bugs after running their tool. They also point out some bugs that would be difficult to find by testing. They explicitly state “what's interesting in our approach is that all these are achieved without imposing any (restrictive) static type system in the language.” The authors have a follow-on paper, “Static Detection of Race Conditions in Erlang”, which extends the approach.

The list of papers that find bugs using static analysis without explicitly adding types is too long to list. This is just one typical example.

0install: Replacing Python; Leonard, T., pt2, pt3

Abstract

No abstract because this is a series of blog posts.

Summary

This compares ATS, C#, Go, Haskell, OCaml, Python and Rust. The author assigns scores to various criteria, but it's really a qualitative comparison. But it's interesting reading because it seriously considers the effect of language on a non-trivial codebase (30kLOC).

The author implemented parts of 0install in various languages and then eventually decided on Ocaml and ported the entire thing to Ocaml. There are some great comments about why the author chose Ocaml and what the author gained by using Ocaml over Python.

Verilog vs. VHDL design competition; Cooley, J

Abstract

No abstract because it's a usenet posting

Summary

Subjects were given 90 minutes to create a small chunk of hardware, a synchronous loadable 9-bit increment-by-3 decrement-by-5 up/down counter that generated even parity, carry and borrow, with the goal of optimizing for cycle time of the synthesized result. For the software folks reading this, this is something you'd expect to be able to do in 90 minutes if nothing goes wrong, or maybe if only a few things go wrong.

Subjects were judged purely by how optimized their result was, as long as it worked. Results that didn't pass all tests were disqualified. Although the task was quite simple, it was made substantially more complicated by the strict optimization goal. For any software readers out there, this task is approximately as complicated as implementing the same thing in assembly, where your assembler takes 15-30 minutes to assemble something.

Subjects could use Verilog (unityped) or VHDL (typed). 9 people chose Verilog and 5 chose VHDL.

During the expierment, there were a number of issues that made things easier or harder for some subjects. Overall, Verilog users were affected more negatively than VHDL users. The license server for the Verilog simulator crashed. Also, four of the five VHDL subjects were accidentally given six extra minutes. The author had manuals for the wrong logic family available, and one Verilog user spent 10 minutes reading the wrong manual before giving up and using his intuition. One of the Verilog users noted that they passed the wrong version of their code along to be tested and failed because of that. One of the VHDL users hit a bug in the VHDL simulator.

Of the 9 Verilog users, 8 got something synthesized before the 90 minute deadline; of those, 5 had a design that passed all tests. None of the VHDL users were able to synthesize a circuit in time.

Two of the VHDL users complained about issues with types “I can't believe I got caught on a simple typing error. I used IEEE std_logic_arith, which requires use of unsigned & signed subtypes, instead of std_logic_unsigned.”, and "I ran into a problem with VHDL or VSS (I'm still not sure.) This case statement doesn't analyze: ‘subtype two_bits is unsigned(1 downto 0); case two_bits'(up & down)...' But what worked was: ‘case two_bits'(up, down)...' Finally I solved this problem by assigning the concatenation first to a[n] auxiliary variable."

Comparing mathematical provers; Wiedijk, F

Abstract

We compare fifteen systems for the formalizations of mathematics with the computer. We present several tables that list various properties of these programs. The three main dimensions on which we compare these systems are: the size of their library, the strength of their logic and their level of automation.

Summary

The author compares the type systems and foundations of various theorem provers, and comments on their relative levels of proof automation.

Type systems of provers Foundations of provers Graph of automation level of provers

The author looked at one particular problem (proving the irrationality of the square root of two) and examined how different systems handle the problem, including the style of the proof and its length. There's a table of lengths, but it doesn't match the updated code examples provided here. For instance, that table claims that the ACL2 proof is 206 lines long, but there's a 21 line ACL2 proof here.

The author has a number of criteria for determining how much automation prover provides, but he freely admits that it's highly subjective. The author doesn't provide the exact rubric used for scoring, but he mentions that a more automated interaction style, user automation, powerful built-in automation, and the Poincare principle (basically whether the system lets you write programs to solve proofs algorithmically) all count towards being more automated, and more powerful logic (e.g., first-order v. higher-order), logical framework dependent types, and de Bruijn criterion (having a small guaranteed kernel) count towards being more mathematical.

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects; Delory, D; Knutson, C; Chun, S

Abstract

Brooks and others long ago suggested that on average computer programmers write the same number of lines of code in a given amount of time regardless of the programming language used. We examine data collected from the CVS repositories of 9,999 open source projects hosted on SourceForge.net to test this assump- tion for 10 of the most popular programming languages in use in the open source community. We find that for 24 of the 45 pairwise comparisons, the programming language is a significant factor in determining the rate at which source code is written, even after accounting for variations between programmers and projects.

Summary

The authors say “our goal is not to construct a predictive or explanatory model. Rather, we seek only to develop a model that sufficiently accounts for the variation in our data so that we may test the significance of the estimated effect of programming language.” and that's what they do. They get some correlations, but it's hard to conclude much of anything from them.

The Unreasonable Effectiveness of Dynamic Typing for Practical Programs; Smallshire, R

Abstract

Some programming language theorists would have us believe that the one true path to working systems lies in powerful and expressive type systems which allow us to encode rich constraints into programs at the time they are created. If these academic computer scientists would get out more, they would soon discover an increasing incidence of software developed in languages such a Python, Ruby and Clojure which use dynamic, albeit strong, type systems. They would probably be surprised to find that much of this software—in spite of their well-founded type-theoretic hubris—actually works, and is indeed reliable out of all proportion to their expectations.This talk—given by an experienced polyglot programmer who once implemented Hindley Milner static type inference for “fun”, but who now builds large and successful systems in Python—explores the disconnect between the dire outcomes predicted by advocates of static typing versus the near absence of type errors in real world systems built with dynamic languages: Does diligent unit testing more than make up for the lack of static typing? Does the nature of the type system have only a low-order effect on reliability compared to the functional or imperative programming paradigm in use? How often is the dynamism of the type system used anyway? How much type information can JITs exploit at runtime? Does the unwarranted success of dynamically typed languages get up the nose of people who write Haskell?

Summary

The speaker used data from Github to determine that approximately 2.7% of Python bugs are type errors. Python's TypeError, AttributeError, and NameError were classified as type errors. The speaker rounded 2.7% down to 2% and claimed that 2% of errors were type related. The speaker mentioned that on a commercial codebase he worked with, 1% of errors were type related, but that could be rounded down from anything less than 2%. The speaker mentioned looking at the equivalent errors in Ruby, Clojure, and other dynamic languages, but didn't present any data on those other languages.

This data might be good but it's impossible to tell because there isn't enough information about the methodology. Something this has going for is that the number is in the right ballpark, compared to the made up number we got when compared the bug rate from Code Complete to the number of bugs found by Farrer. Possibly interesting, but thin.

Summary of summaries

This isn't an exhaustive list. For example, I haven't covered “An Empirical Comparison of Static and Dynamic Type Systems on API Usage in the Presence of an IDE: Java vs. Groovy with Eclipse”, and “Do developers benefit from generic types?: an empirical comparison of generic and raw types in java” because they didn't seem to add much to what we've already seen.

I didn't cover a number of older studies that are in the related work section of almost all the listed studies both because the older studies often cover points that aren't really up for debate anymore and also because the experimental design in a lot of those older papers leaves something to be desired. Feel free to ping me if there's something you think should be added to the list.

Not only is this list not exhaustive, it's not objective and unbiased. If you read the studies, you can get a pretty good handle on how the studies are biased. However, I can't provide enough information for you to decide for yourself how the studies are biased without reproducing most of the text of the papers, so you're left with my interpretation of things, filtered through my own biases. That can't be helped, but I can at least explain my biases so you can discount my summaries appropriately.

I like types. I find ML-like languages really pleasant to program in, and if I were king of the world, we'd all use F# as our default managed language. The situation with unmanaged languages is a bit messier. I certainly prefer C++ to C because std::unique_ptr and friends make C++ feel a lot safer than C. I suspect I might prefer Rust once it's more stable. But while I like languages with expressive type systems, I haven't noticed that they make me more productive or less bug prone0.

Now that you know what my biases are, let me give you my interpretation of the studies. Of the controlled experiments, only three show an effect large enough to have any practical significance. The Prechelt study comparing C, C++, Java, Perl, Python, Rexx, and Tcl; the Endrikat study comparing Java and Dart; and Cooley's experiment with VHDL and Verilog. Unfortunately, they all have issues that make it hard to draw a really strong conclusion.

In the Prechelt study, the populations were different between dynamic and typed languages, and the conditions for the tasks were also different. There was a follow-up study that illustrated the issue by inviting Lispers to come up with their own solutions to the problem, which involved comparing folks like Darius Bacon to random undergrads. A follow-up to the follow-up literally involves comparing code from Peter Norvig to code from random college students.

In the Endrikat study, they specifically picked a task where they thought static typing would make a difference, and they drew their subjects from a population where everyone had taken classes using the statically typed language. They don't comment on whether or not students had experience in the dynamically typed language, but it seems safe to assume that most or all had less experience in the dynamically typed language.

Cooley's experiment was one of the few that drew people from a non-student population, which is great. But, as with all of the other experiments, the task was a trivial toy task. While it seems damning that none of the VHDL (static language) participants were able to complete the task on time, it is extremely unusual to want to finish a hardware design in 1.5 hours anywhere outside of a school project. You might argue that a large task can be broken down into many smaller tasks, but a plausible counterargument is that there are fixed costs using VHDL that can be amortized across many tasks.

As for the rest of the experiments, the main takeaway I have from them is that, under the specific set of circumstances described in the studies, any effect, if it exists at all, is small.

Moving on to the case studies, the two bug finding case studies make for interesting reading, but they don't really make a case for or against types. One shows that transcribing Python programs to Haskell will find a non-zero number of bugs of unknown severity that might not be found through unit testing that's line-coverage oriented. The pair of Erlang papers shows that you can find some bugs that would be difficult to find through any sort of testing, some of which are severe, using static analysis.

As a user, I find it convenient when my compiler gives me an error before I run separate static analysis tools, but that's minor, perhaps even smaller than the effect size of the controlled studies listed above.

I found the 0install case study (that compared various languages to Python and eventually settled on Ocaml) to be one of the more interesting things I ran across, but it's the kind of subjective thing that everyone will interpret differently, which you can see by looking.

This fits with the impression I have (in my little corner of the world, ACL2, Isabelle/HOL, and PVS are the most commonly used provers, and it makes sense that people would prefer more automation when solving problems in industry), but that's also subjective.

And then there are the studies that mine data from existing projects. Unfortunately, I couldn't find anybody who did anything to determine causation (e.g., find an appropriate instrumental variable), so they just measure correlations. Some of the correlations are unexpected, but there isn't enough information to determine why. The lack of any causal instrument doesn't stop people like Ray et al. from making strong, unsupported, claims.

The only data mining study that presents data that's potentially interesting without further exploration is Smallshire's review of Python bugs, but there isn't enough information on the methodology to figure out what his study really means, and it's not clear why he hinted at looking at data for other languages without presenting the data2.

Some notable omissions from the studies are comprehensive studies using experienced programmers, let alone studies that have large populations of "good" or "bad" programmers, looking at anything approaching a significant project (in places I've worked, a three month project would be considered small, but that's multiple orders of magnitude larger than any project used in a controlled study), using "modern" statically typed languages, using gradual/optional typing, using modern mainstream IDEs (like VS and Eclipse), using modern radical IDEs (like LightTable), using old school editors (like Emacs and vim), doing maintenance on a non-trivial codebase, doing maintenance with anything resembling a realistic environment, doing maintenance on a codebase you're already familiar with, etc.

If you look at the internet commentary on these studies, most of them are passed around to justify one viewpoint or another. The Prechelt study on dynamic vs. static, along with the follow-ups on Lisp are perennial favorites of dynamic language advocates, and github mining study has recently become trendy among functional programmers.

A twitter thread! In the twitter thread, someone who actually read the study notes that other factors dominate language (which is explicitly pointed out by the authors of the study), which prompts someone else to respond 'Why don't you just read the paper? It explains all this. In the abstract, even.'

Other than cherry picking studies to confirm a long-held position, the most common response I've heard to these sorts of studies is that the effect isn't quantifiable by a controlled experiment. However, I've yet to hear a specific reason that doesn't also apply to any other field that empirically measures human behavior. Compared to a lot of those fields, it's easy to run controlled experiments or do empirical studies. It's true that controlled studies only tell you something about a very limited set of circumstances, but the fix to that isn't to dismiss them, but to fund more studies. It's also true that it's tough to determine causation from ex-post empirical studies, but the solution isn't to ignore the data, but to do more sophisticated analysis. For example, econometric methods are often able to make a case for causation with data that's messier than the data we've looked at here.

The next most common response is that their viewpoint is still valid because their specific language or use case isn't covered. Maybe, but if the strongest statement you can make for your position is that there's no empirical evidence against the position, that's not much of a position.

If you've managed to read this entire thing without falling asleep, you might be interested in my opinion on tests.

Responses

Here are the responses I've gotten from people mentioned in this post. Robert Smallshire said "Your review article is very good. Thanks for taking the time to put it together." On my comment about the F# "mistake" vs. trolling, his reply was "Neither. That torque != energy is obviously solved by modeling quantities not dimensions. The point being that this modeling of quantities with types takes effort without necessarily delivering any value." Not having done much with units myself, I don't have an informed opinion on this, but my natural bias is to try to encode the information in types if at all possible.

Bartosz Milewski said "Guilty as charged!". Wow. Much Respect. But notice that, as of this update, The correction has been retweeted 1/25th as often as the original tweet. People want to believe there's evidence their position is superior. People don't want to believe the evidence is murky, or even possibly against them. Misinformation people want to believe spreads faster than information people don't want to believe.

On a related twitter conversation, Andreas Stefik said "That is not true. It depends on which scientific question. Static vs. Dynamic is well studied.", "Profound rebuttal. I had better retract my peer reviewed papers, given this new insight!", "Take a look at the papers...", and "This is a serious misrepresentation of our studies." I muted the guy since it didn't seem to be going anywhere, but it's possible there was a substantive response buried in some later tweet. It's pretty easy to take twitter comments out of context, so check out the thread yourself if you're really curious.

I have a lot of respect for the folks who do these experiments, which is, unfortunately, not mutual. But the really unfortunate thing is that some of the people who do these experiments think that static v. dynamic is something that is, at present, "well studied". There are plenty of equally difficult to study subfields in the social sciences that have multiple orders of magnitude more research going on, that are considered open problems, but at least some researchers already consider this to be well studied!

Acknowledgements

Thanks to Leah Hanson, Joe Wilder, Robert David Grant, Jakub Wilk, Rich Loveland, Eirenarch, Edward Knight, and Evan Farrer for comments/corrections/discussion.


  1. This was from a talk at Strange Loop this year. The author later clarified his statement with "To me, this follows immediately (a technical term in logic meaning the same thing as “trivially”) from the Curry-Howard Isomorphism we discussed, and from our Types vs. Tests: An Epic Battle? presentation two years ago. If types are theorems (they are), and implementations are proofs (they are), and your SLA is a guarantee of certain behavior of your system (it is), then how can using technology that precludes forbidding undesirable behavior of your system before other people use it (dynamic typing) possibly be anything but unethical?" [return]
  2. Just as an aside, I find the online responses to Smallshire's study to be pretty great. There are, of course, the usual responses about how his evidence is wrong and therefore static types are, in fact, beneficial because there's no evidence against them, and you don't need evidence for them because you can arrive at the proper conclusion using pure reason. The really interesting bit is that, at one point, Smallshire presents an example of an F# program that can't catch a certain class of bug via its type system, and the online response is basically that he's an idiot who should have written his program in a different way so that the type system should have caught the bug. I can't tell if Smallshire's bug was an honest mistake or masterful trolling. [return]

CLWB and PCOMMIT

2014-11-05 08:00:00

The latest version of the Intel manual has a couple of new instructions for non-volatile storage, like SSDs. What's that about?

Before we look at the instructions in detail, let's take a look at the issues that exist with super fast NVRAM. One problem is that next generation storage technologies (PCM, 3d XPoint, etc.), will be fast enough that syscall and other OS overhead can be more expensive than the actual cost of the disk access1. Another is the impedance mismatch between the x86 memory hierarchy and persistent memory. In both cases, it's basically an Amdahl's law problem, where one component has improved so much that other components have to improve to keep up.

There's a good paper by Todor Mollov, Louis Eisner, Arup De, Joel Coburn, and Steven Swanson on the first issue; I'm going to present one of their graphs below.

OS and other overhead for NVRAM operations

Everything says “Moneta” because that's the name of their system (which is pretty cool, BTW; I recommend reading the paper to see how they did it). Their “baseline” case is significantly better than you'll get out of a stock system. They did a number of optimizations (e.g., bypassing Linux's IO scheduler and removing context switches where possible), which reduces latency by 62% over plain old linux. Despite that, the hardware + DMA cost of the transaction (the white part of the bar) is dwarfed by the overhead. Note that they consider the cost of the DMA to be part of the hardware overhead.

They're able to bypass the OS entirely and reduce a lot of the overhead, but it's still true that the majority of the cost of a write is overhead.

OS bypass speedup for NVRAM operations

Despite not being able to get rid of all of the overhead, they get pretty significant speedups, both on small microbenchmarks and real code. So that's one problem. The OS imposes a pretty large tax on I/O when your I/O device is really fast.

Maybe you can bypass large parts of that problem by just mapping your NVRAM device to a region of memory and committing things to it as necessary. But that runs into another problem. which is the impedance mismatch between how caches interact with the NVRAM region if you want something like transactional semantics.

This is described in more detail in this report by Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. I'm going to borrow a couple of their figures, too.

Rough memory hierarchy diagram

We've got this NVRAM region which is safe and persistent, but before the CPU can get to it, it has to go through multiple layers with varying ordering guarantees. They give the following example:

Consider, for example, a common programming idiom where a persistent memory location N is allocated, initialized, and published by assigning the allocated address to a global persistent pointer p. If the assignment to the global pointer becomes visible in NVRAM before the initialization (presumably because the latter is cached and has not made its way to NVRAM) and the program crashes at that very point, a post-restart dereference of the persistent pointer will read uninitialized data. Assuming writeback (WB) caching mode, this can be avoided by inserting cache-line flushes for the freshly allocated persistent locations N before the assignment to the global persistent pointer p.

Inserting CLFLUSH instructions all over the place works, but how much overhead is that?

Persistence overhead on reads and writes

The four memory types they look at (and the four that x86 supports) are writeback (WB), writethrough (WT), write combine (WC), and uncacheable (UC). WB is what you deal with under normal circumstances. Memory can be cached and it's written back whenever it's forced to be. WT allows memory to be cached, but writes have to be written straight through to memory, i.e., memory is kept up to date with the cache. UC simply can't be cached. WC is like UC, except that writes can be coalesced before being sent out to memory.

The R, W, and RW benchmarks are just benchmarks of reading and writing memory. WB is clearly the best, by far (lower is better). If you want to get an intuitive feel for how much better WB is than the other policies, try booting an OS with anything but WB memory.

I've had to do that on occasion because I use to work for a chip company, and when we first got the chip back, we often didn't know which bits we had to disable to work around bugs. The simplest way to make progress is often to disable caches entirely. That “works”, but even minimal OSes like DOS are noticeably slow to boot without WB memory. My recollection is that Win 3.1 takes the better part of an hour, and that Win 95 is a multiple hour process.

The _b benchmarks force writes to be visible to memory. For the WB case, that involves an MFENCE followed by a CLFLUSH. WB with visibility constraints is significantly slower than the other alternatives. It's a multiple order of magnitude slowdown over WB when writes don't have to be ordered and flushed.

They also run benchmarks on some real data structures, with the constraint that data should be persistently visible.

Persistence overhead on data structure operations

The performance of regular WB memory can be terribly slow: within a factor of 2 of the performance of running without caches. And that's just the overhead around getting out of the cache hierarchy -- that's true even if your persistent storage is infinitely fast.

Now, let's look how Intel decided to address this. There are two new instructions, CLWB and PCOMMIT.

CLWB acts like CLFLUSH, in that it forces the data to get written out to memory. However, it doesn't force the cache to throw away the data, which makes future reads and writes a lot faster. Also, CLFLUSH is only ordered with respect to MFENCE, but CLWB is also ordered with respect to SFENCE. Here's their description of CLWB:

Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization (treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy. The source operand is a byte memory location.

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line).

CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address.

Executions of CLWB interact with executions of PCOMMIT. The PCOMMIT instruction operates on certain store-to-memory operations that have been accepted to memory. CLWB executed for the same cache line as an older store causes the store to become accepted to memory when the CLWB execution becomes globally visible.

PCOMMIT is applied to entire memory ranges and ensures that everything in the memory range is committed to persistent storage. Here's their description of PCOMMIT:

The PCOMMIT instruction causes certain store-to-memory operations to persistent memory ranges to become persistent (power failure protected).1 Specifically, PCOMMIT applies to those stores that have been accepted to memory.

While all store-to-memory operations are eventually accepted to memory, the following items specify the actions software can take to ensure that they are accepted:

Non-temporal stores to write-back (WB) memory and all stores to uncacheable (UC), write-combining (WC), and write-through (WT) memory are accepted to memory as soon as they are globally visible. If, after an ordinary store to write-back (WB) memory becomes globally visible, CLFLUSH, CLFLUSHOPT, or CLWB is executed for the same cache line as the store, the store is accepted to memory when the CLFLUSH, CLFLUSHOPT or CLWB execution itself becomes globally visible.

If PCOMMIT is executed after a store to a persistent memory range is accepted to memory, the store becomes persistent when the PCOMMIT becomes globally visible. This implies that, if an execution of PCOMMIT is globally visible when a later store to persistent memory is executed, that store cannot become persistent before the stores to which the PCOMMIT applies.

The following items detail the ordering between PCOMMIT and other operations:

A logical processor does not ensure previous stores and executions of CLFLUSHOPT and CLWB (by that logical processor) are globally visible before commencing an execution of PCOMMIT. This implies that software must use appropriate fencing instruction (e.g., SFENCE) to ensure the previous stores-to-memory operations and CLFLUSHOPT and CLWB executions to persistent memory ranges are globally visible (so that they are accepted to memory), before executing PCOMMIT.

A logical processor does not ensure that an execution of PCOMMIT is globally visible before commencing subsequent stores. Software that requires that such stores not become globally visible before PCOMMIT (e.g., because the younger stores must not become persistent before those committed by PCOMMIT) can ensure by using an appropriate fencing instruction (e.g., SFENCE) between PCOMMIT and the later stores.

An execution of PCOMMIT is ordered with respect to executions of SFENCE, MFENCE, XCHG or LOCK-prefixed instructions, and serializing instructions (e.g., CPUID).

Executions of PCOMMIT are not ordered with respect to load operations. Software can use MFENCE to order loads with PCOMMIT.

Executions of PCOMMIT do not serialize the instruction stream.

How much CLWB and PCOMMIT actually improve performance will be up to their implementations. It will be interesting to benchmark these and see how they do. In any case, this is an attempt to solve the WB/NVRAM impedance mismatch issue. It doesn't directly address the OS overhead issue, but that can, to a large extent, be worked around without extra hardware.

If you liked this post, you'll probably also enjoy reading about cache partitioning in Broadwell and newer Intel server parts.

Thanks to Eric Bron for spotting this in the manual and pointing it out, and to Leah Hanson, Nate Rowe, and 'unwind' for finding typos.

If you haven't had enough of papers, Zvonimir Bandic pointed out a paper by Dejan Vučinić, Qingbo Wang, Cyril Guyot, Robert Mateescu, Filip Blagojević, Luiz Franca-Neto, Damien Le Moal, Trevor Bunker, Jian Xu, and Steven Swanson on getting 1.4 us latency and 700k IOPS out of a type of NVRAM

If you liked this post, you might also like this related post on "new" CPU features.


  1. this should sound familiar to HPC and HFT folks with InfiniBand networks. [return]

Caches: LRU v. random

2014-11-03 08:00:00

Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn't so bad. That random eviction isn't bad can be surprising — if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you're more likely to use something if you've used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it's going to cause a miss every time if the loop doesn't fit. A random eviction policy degrades gracefully as the loop gets too big.

In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices (2-random) and just use LRU between those two choices?

Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.

Policy L1 (64k) L2 (256k) L3 (2MB)
2-random 0.91 0.93 0.95
FIFO 0.96 0.97 1.02
LRU 0.90 0.90 0.97
random 1.00 1.00 1.00

Random and FIFO are both strictly worse than either LRU or 2-random. LRU and 2-random are pretty close, with LRU edging out 2-random for the smaller caches and 2-random edging out LRU for the larger caches.

To see if anything odd is going on in any individual benchmark, we can look at the raw results on each sub-benchmark. The L1, L2, and L3 miss rates are all plotted in the same column for each benchmark, below:

Cache miss rates for Sandy Bridge-like cache

As we might expect, LRU does worse than 2-random when the miss rates are high, and better when the miss rates are low.

At this point, it's not clear if 2-random is beating LRU in L3 cache miss rates because it does better when the caches are large or because it does better because it's the third level in a hierarchical cache. Since a cache line that's being actively used in L1 or L2 isn't touched in L3, an eviction can happen from the L3 (which forces an eviction of both the L1 and L2) since, as far as the L3 is concerned, that line hasn't been used recently. This makes it less obvious that LRU is a good eviction policy for L3 cache.

To separate out the effects, let's look at the relative miss rates for a non-hierarchical (single level) vs. hierarchical caches at various sizes2. For the hierarchical cache, the L1 and L2 sizes are as above, 64k and 256k, and only the L3 cache size varies. Below, we've got the geometric means of the ratios3 of how each policy does (over all SPEC sub-benchmarks, compared to random eviction). A possible downside to this metric is that if we have some very low miss rates, those could dominate the mean since small fluctuations will have a large effect on the ratio, but we can look the distribution of results to see if that's the case.

Cache miss ratios for cache sizes between 64K and 16M

L3 cache miss ratios for cache sizes between 512K and 16M

Sizes below 512k are missing for the hierarchical case because of the 256k L2 — we're using an inclusive L3 cache here, so it doesn't really make sense to have an L3 that's smaller than the L2. Sizes above 16M are omitted because cache miss rates converge when the cache gets too big, which is uninteresting.

Looking at the single cache case, it seems that LRU works a bit better than 2-random for smaller caches (lower miss ratio is better), 2-random edges out LRU as the cache gets bigger. The story is similar in the hierarchical case, except that we don't really look at the smaller cache sizes where LRU is superior.

Comparing the two cases, the results are different, but similar enough that it looks our original results weren't only an artifact of looking at the last level of a hierarchical cache.

Below, we'll look at the entire distribution so we can see if the mean of the ratios is being skewed by tiny results.

L3 cache miss ratios for cache sizes between 512K and 16M

L3 cache miss ratios for cache sizes between 512K and 16M

It looks like, for a particular cache size (one column of the graph), the randomized algorithms do better when miss rates are relatively high and worse when miss rates are relatively low, so, if anything, they're disadvantaged when we just look at the geometric mean — if we were to take the arithmetic mean, the result would be dominated by the larger results, where 2 random choices and plain old random do relatively well4.

From what we've seen of the mean ratios, 2-random looks fine for large caches, and from what we've seen of the distribution of the results, that's despite 2-random being penalized by the mean ratio metric, which makes it seem pretty good for large caches.

However, it's common to implement pseudo-LRU policies because LRU can be too expensive to be workable. Since 2-random requires having at least as much information as LRU, let's take a look at what happens we use pseudo 2-random (approximately 80% accurate), and pseudo 3-random (a two-level tournament, each level of which is approximately 80% accurate).

Since random and FIFO are clearly not good replacement policies, I'll leave them out of the following graphs. Also, since the results were similar in the single cache as well as multi-level cache case, we can just look at the results from the more realistic multi-level cache case.

L3 cache miss ratios for cache sizes between 512K and 16M

Since pseudo 2-random acts like random 20% of the time and 2-random 80% of the time, we might expect it to fall somewhere between 2-random and random, which is exactly what happens. A simple tweak to try to improve pseudo 2-random is to try pseudo 3-random (evict the least recently used of 3 random choices). While that's still not quite as good as true 2-random, it's pretty close, and it's still better than LRU (and pseudo LRU) for caches larger than 1M.

The one big variable we haven't explored is the set associativity. To see how LRU compares with 2-random across different cache sizes let's look at the LRU:2-random miss ratio (higher/red means LRU is better, lower/green means 2-random is better).

Cache miss ratios for cache sizes between 64K and 16M with associativities between and 64

On average, increasing associativity increases the difference between the two policies. As before, LRU is better for small caches and 2-random is better for large caches. Associativities of 1 and 2 aren't shown because they should be identical for both algorithms.

There's still a combinatorial explosion of possibilities we haven't tried yet. One thing to do is to try different eviction policies at different cache levels (LRU for L1 and L2 with 2-random for L3 seems promising). Another thing to do is to try this for different types of caches. I happened to choose CPU caches because it's easy to find simulators and benchmark traces, but in today's “put a cache on it” world, there are a lot of other places 2-random can be applied5.

For any comp arch folks, from this data, I suspect that 2-random doesn't keep up with adaptive policies like DIP (although it might — it's in the right ballpark, but it was characterized on a different workload using a different simulator, so it's not 100% clear). However, A pseudo 2-random policy can be implemented that barely uses more resources than pseudo-LRU policies, which makes this very cheap compared to DIP. Also, we can see that pseudo 3-random is substantially better than pseudo 2-random, which indicates that k-random is probably an improvement over 2-random for the k. Some k-random policy might be an improvement over DIP.

So we've seen that this works, but why would anyone think to do this in the first place? The Power of Two Random Choices: A Survey of Techniques and Results by Mitzenmacher, Richa, and Sitaraman has a great explanation. The mathematical intuition is that if we (randomly) throw n balls into n bins, the maximum number of balls in any bin is O(log n / log log n) with high probability, which is pretty much just O(log n). But if (instead of choosing randomly) we choose the least loaded of k random bins, the maximum is O(log log n / log k) with high probability, i.e., even with two random choices, it's basically O(log log n) and each additional choice only reduces the load by a constant factor.

This turns out to have all sorts of applications; things like load balancing and hash distribution are natural fits for the balls and bins model. There are also a lot of applications that aren't obviously analogous to the balls and bins model, like circuit routing and Erdős–Rényi graphs.

Thanks to Jan Elder and Mark Hill for making dinero IV freely available, to Aleksandar Milenkovic for providing SPEC CPU traces, and to Carl Vogel, James Porter, Peter Fraenkel, Katerina Barone-Adesi, Jesse Luehrs, Lea Albaugh, and Kevin Lynagh for advice on plots and plotting packages, to Mindy Preston for finding a typo in the acknowledgments, to Lindsey Kuper for pointing out some terminology stuff, to Tom Wenisch for suggesting that I check out CMP$im for future work, and to Leah Hanson for extensive comments on the entire post.


  1. Simulations were done with dinero IV with SBC traces. These were used because professors and grad students have gotten more protective of simulator code over the past couple decades, making it hard to find a modern open source simulator on GitHub. However, dinero IV supports hierarchical caches with prefetching, so it should give a reasonable first-order approximation.

    Note that 175.vpr and 187.facerec weren't included in the traces, so they're missing from all results in this post.

    [return]
  2. Sizes are limited by dinero IV, which requires cache sizes to be a power of 2. [return]
  3. Why consider the geometric mean of the ratios? We have different “base” miss rates for different benchmarks. For example, 181.mcf has a much higher miss rate than 252.eon. If we're trying to figure out which policy is best, those differences are just noise. Looking at the ratios removes that noise.

    And if we were just comparing those two, we'd like being 2x better on both to be equivalent to being 4x better on one and just 1x on the other, or 8x better on one and 1/2x “better” on the other. Since the geometric mean is the nth-root of the product of the results, it has that property.

    [return]
  4. We can see that 2-choices tends to be better than LRU for high miss rates by looking for the high up clusters of a green triangle, red square, empty diamond, and a blue circle, and seeing that it's usually the case that the green triangle is above the red square. It's too cluttered to really tell what's going on at the lower miss rates. I admit I cheated and looked at some zoomed in plots. [return]
  5. If you know of a cache simulator for some other domain that I can use, please let me know! [return]

Testing v. informal reasoning

2014-11-03 08:00:00

This is an off-the-cuff comment for Hacker School's Paper of the Week Read Along series for Out of the Tar Pit.

I find the idea itself, which is presented in sections 7-10, at the end of the paper, pretty interesting. However, I have some objections to the motivation for the idea, which makes up the first 60% of the paper.

Rather than do one of those blow-by-blow rebuttals that's so common on blogs, I'll limit my comments to one widely circulated idea that I believe is not only mistaken but actively harmful.

There's a claim that “informal reasoning” is more important than “testing”1, based mostly on the strength of this quote from Dijkstra:

testing is hopelessly inadequate....(it) can be used very effectively to show the presence of bugs but never to show their absence.

They go on to make a number of related claims, like “The key problem is that a test (of any kind) on a system or component that is in one particular state tells you nothing at all about the behavior of that system or component when it happens to be in another state.”, with the conclusion that stateless simplicity is the only possible fix. Needless to say, they assume that simplicity is actually possible.

I actually agree with the bit about testing -- there's no way to avoid bugs if you create a system that's too complex to formally verify.

However, there are plenty of real systems with too much irreducible complexity to make simple. Drawing from my own experience, no human can possibly hope to understand a modern high-performance CPU well enough to informally reason about its correctness. That's not only true now, it's been true for decades. It becomes true the moment someone introduces any sort of speculative execution or caching. These things are inherently stateful and complicated. They're so complicated that the only way to model performance (in order to run experiments to design high performance chips) is to simulate precisely what will happen, since the exact results are too complex for humans to reason about and too messy to be mathematically tractable. It's possible to make a simple CPU, but not one that's fast and simple. This doesn't only apply to CPUs -- performance complexity leaks all the way up the stack.

And it's not only high performance hardware and software that's complex. Some domains are just really complicated. The tax code is 73k pages long. It's just not possible to reason effectively about something that complicated, and there are plenty of things that are that complicated.

And then there's the fact that we're human. We make mistakes. Euclid's elements contains a bug in the very first theorem. Andrew Gelman likes to use this example of an "obviously" bogus published probability result (but not obvious to the authors or the peer reviewers). One of the famous Intel CPU bugs allegedly comes from not testing something because they "knew" it was correct. No matter how smart or knowledgeable, humans are incapable of reasoning correctly all of the time.

So what do you do? You write tests! They're necessary for anything above a certain level of complexity. The argument the authors make is that they're not sufficient because the state space is huge and a test of one state tells you literally nothing about a test of any other state.

That's true if you look at your system as some kind of unknowable black box, but it turns out to be untrue in practice. There are plenty of unit testing tools that will do state space reduction based on how similar inputs affect similar states, do symbolic execution, etc. This turns out to work pretty well.

And even without resorting to formal methods, you can see this with plain old normal tests. John Regehr has noted that when Csmith finds a bug, test case reduction often finds a slew of other bugs. Turns out, tests often tell you something about nearby states.

This is not just a theoretical argument. I did CPU design/verification/test for 7.5 years at a company that relied primarily on testing. In that time I can recall two bugs that were found by customers (as opposed to our testing). One was a manufacturing bug that has no software analogue. The software equivalent would be that the software works for years and then after lots of usage at high temperature 1% of customers suddenly can't use their software anymore. Bad, but not a failure of anything analogous to software testing.

The other bug was a legitimate logical bug (in the cache memory hierarchy, of course). It's pretty embarrassing that we shipped samples of a chip with a real bug to customers, but I think that most companies would be pretty happy with one logical bug in seven and a half years.

Testing may not be sufficient to find all bugs, but it can be sufficient to achieve better reliability than pretty much any software company cares to.

Thanks (or perhaps anti-thanks) to David Albert for goading me into writing up this response and to Govert Versluis for catching a typo.


  1. These kinds of claims are always a bit odd to talk about. Like nature v. nurture, we clearly get bad results if we set either quantity to zero, and they interact in a way that makes it difficult to quantify the relative effect of non-zero quantities. [return]

Assembly v. intrinsics

2014-10-19 08:00:00

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

One of the top search hits has sample code and benchmarks for both native popcnt as well as the software version using pshufb. Their code requires MSVC, which I don't have access to, but their first popcnt implementation just calls the popcnt intrinsic in a loop, which is fairly easy to reproduce in a form that gcc and clang will accept. Timing it is also pretty simple, since we're just timing a function (that happens to count the number of bits set in some fixed sized buffer).

uint32_t builtin_popcnt(const uint64_t* buf, int len) {
  int cnt = 0;
  for (int i = 0; i < len; ++i) {
    cnt += __builtin_popcountll(buf[i]);
  }
  return cnt;
}

This is slightly different from the code I linked to above, since they use the dword (32-bit) version of popcnt, and we're using the qword (64-bit) version. Since our version gets twice as much done per loop iteration, I'd expect our version to be faster than their version.

Running clang -O3 -mpopcnt -funroll-loops produces a binary that we can examine. On macs, we can use otool -tv to get the disassembly. On linux, there's objdump -d.

_builtin_popcnt:
; address                        instruction
0000000100000b30        pushq   %rbp
0000000100000b31        movq    %rsp, %rbp
0000000100000b34        movq    %rdi, -0x8(%rbp)
0000000100000b38        movl    %esi, -0xc(%rbp)
0000000100000b3b        movl    $0x0, -0x10(%rbp)
0000000100000b42        movl    $0x0, -0x14(%rbp)
0000000100000b49        movl    -0x14(%rbp), %eax
0000000100000b4c        cmpl    -0xc(%rbp), %eax
0000000100000b4f        jge     0x100000bd4
0000000100000b55        movslq  -0x14(%rbp), %rax
0000000100000b59        movq    -0x8(%rbp), %rcx
0000000100000b5d        movq    (%rcx,%rax,8), %rax
0000000100000b61        movq    %rax, %rcx
0000000100000b64        shrq    %rcx
0000000100000b67        movabsq $0x5555555555555555, %rdx
0000000100000b71        andq    %rdx, %rcx
0000000100000b74        subq    %rcx, %rax
0000000100000b77        movabsq $0x3333333333333333, %rcx
0000000100000b81        movq    %rax, %rdx
0000000100000b84        andq    %rcx, %rdx
0000000100000b87        shrq    $0x2, %rax
0000000100000b8b        andq    %rcx, %rax
0000000100000b8e        addq    %rax, %rdx
0000000100000b91        movq    %rdx, %rax
0000000100000b94        shrq    $0x4, %rax
0000000100000b98        addq    %rax, %rdx
0000000100000b9b        movabsq $0xf0f0f0f0f0f0f0f, %rax
0000000100000ba5        andq    %rax, %rdx
0000000100000ba8        movabsq $0x101010101010101, %rax
0000000100000bb2        imulq   %rax, %rdx
0000000100000bb6        shrq    $0x38, %rdx
0000000100000bba        movl    %edx, %esi
0000000100000bbc        movl    -0x10(%rbp), %edi
0000000100000bbf        addl    %esi, %edi
0000000100000bc1        movl    %edi, -0x10(%rbp)
0000000100000bc4        movl    -0x14(%rbp), %eax
0000000100000bc7        addl    $0x1, %eax
0000000100000bcc        movl    %eax, -0x14(%rbp)
0000000100000bcf        jmpq    0x100000b49
0000000100000bd4        movl    -0x10(%rbp), %eax
0000000100000bd7        popq    %rbp
0000000100000bd8        ret

Well, that's interesting. Clang seems to be calculating things manually rather than using popcnt. It seems to be using the approach described here, which is something like

x = x - ((x >> 0x1) & 0x5555555555555555);
x = (x & 0x3333333333333333) + ((x >> 0x2) & 0x3333333333333333);
x = (x + (x >> 0x4)) & 0xF0F0F0F0F0F0F0F;
ans = (x * 0x101010101010101) >> 0x38;

That's not bad for a simple implementation that doesn't rely on any kind of specialized hardware, but that's going to take a lot longer than a single popcnt instruction.

I've got a pretty old version of clang (3.0), so let me try this again after upgrading to 3.4, in case they added hardware popcnt support “recently”.

0000000100001340        pushq   %rbp         ; save frame pointer
0000000100001341        movq    %rsp, %rbp   ; new frame pointer
0000000100001344        xorl    %ecx, %ecx   ; cnt = 0
0000000100001346        testl   %esi, %esi
0000000100001348        jle     0x100001363
000000010000134a        nopw    (%rax,%rax)
0000000100001350        popcntq (%rdi), %rax ; “eax” = popcnt[rdi]
0000000100001355        addl    %ecx, %eax   ; eax += cnt
0000000100001357        addq    $0x8, %rdi   ; increment address by 64-bits (8 bytes)
000000010000135b        decl    %esi         ; decrement loop counter; sets flags
000000010000135d        movl    %eax, %ecx   ;  cnt = eax; does not set flags
000000010000135f        jne     0x100001350  ; examine flags. if esi != 0, goto popcnt
0000000100001361        jmp     0x100001365  ; goto “restore frame pointer”
0000000100001363        movl    %ecx, %eax
0000000100001365        popq    %rbp         ; restore frame pointer
0000000100001366        ret

That's better! We get a hardware popcnt! Let's compare this to the SSSE3 pshufb implementation presented here as the fastest way to do a popcnt. We'll use a table like the one in the link to show speed, except that we're going to show a rate, instead of the raw cycle count, so that the relative speed between different sizes is clear. The rate is GB/s, i.e., how many gigs of buffer we can process per second. We give the function data in chunks (varying from 1kb to 16Mb); each column is the rate for a different chunk-size. If we look at how fast each algorithm is for various buffer sizes, we get the following.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M
Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5
PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6

That's not so great. Relative to the the benchmark linked above, we're doing better because we're using 64-bit popcnt instead of 32-bit popcnt, but the PSHUFB version is still almost twice as fast1.

One odd thing is the way cnt gets accumulated. cnt is stored in ecx. But, instead of adding the result of the popcnt to ecx, clang has decided to add ecx to the result of the popcnt. To fix that, clang then has to move that sum into ecx at the end of each loop iteration.

The other noticeable problem is that we only get one popcnt per iteration of the loop, which means the loop isn't getting unrolled, and we're paying the entire cost of the loop overhead for each popcnt. Unrolling the loop can also let the CPU extract more instruction level parallelism from the code, although that's a bit beyond the scope of this blog post.

Using clang, that happens even with -O3 -funroll-loops. Using gcc, we get a properly unrolled loop, but gcc has other problems, as we'll see later. For now, let's try unrolling the loop ourselves by calling __builtin_popcountll multiple times during each iteration of the loop. For simplicity, let's try doing four popcnt operations on each iteration. I don't claim that's optimal, but it should be an improvement.

uint32_t builtin_popcnt_unrolled(const uint64_t* buf, int len) {
  assert(len % 4 == 0);
  int cnt = 0;
  for (int i = 0; i < len; i+=4) {
    cnt += __builtin_popcountll(buf[i]);
    cnt += __builtin_popcountll(buf[i+1]);
    cnt += __builtin_popcountll(buf[i+2]);
    cnt += __builtin_popcountll(buf[i+3]);
  }
  return cnt;
}

The core of our loop now has

0000000100001390        popcntq (%rdi,%rcx,8), %rdx
0000000100001396        addl    %eax, %edx
0000000100001398        popcntq 0x8(%rdi,%rcx,8), %rax
000000010000139f        addl    %edx, %eax
00000001000013a1        popcntq 0x10(%rdi,%rcx,8), %rdx
00000001000013a8        addl    %eax, %edx
00000001000013aa        popcntq 0x18(%rdi,%rcx,8), %rax
00000001000013b1        addl    %edx, %eax

with pretty much the same code surrounding the loop body. We're doing four popcnt operations every time through the loop, which results in the following performance:

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M
Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5
PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6
Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2

Between using 64-bit popcnt and unrolling the loop, we've already beaten the allegedly faster pshufb code! But it's close enough that we might get different results with another compiler or some other chip. Let's see if we can do better.

So, what's the deal with this popcnt false dependency bug that's been getting a lot of publicity lately? Turns out, popcnt has a false dependency on its destination register, which means that even though the result of popcnt doesn't depend on its destination register, the CPU thinks that it does and will wait until the destination register is ready before starting the popcnt instruction.

x86 typically has two operand operations, e.g., addl %eax, %edx adds eax and edx, and then places the result in edx, so it's common for an operation to have a dependency on its output register. In this case, there shouldn't be a dependency, since the result doesn't depend on the contents of the output register, but that's an easy bug to introduce, and a hard one to catch2.

In this particular case, popcnt has a 3 cycle latency, but it's pipelined such that a popcnt operation can execute each cycle. If we ignore other overhead, that means that a single popcnt will take 3 cycles, 2 will take 4 cycles, 3 will take 5 cycles, and n will take n+2 cycles, as long as the operations are independent. But, if the CPU incorrectly thinks there's a dependency between them, we effectively lose the ability to pipeline the instructions, and that n+2 turns into 3n.

We can work around this by buying a CPU from AMD or VIA, or by putting the popcnt results in different registers. Let's making an array of destinations, which will let us put the result from each popcnt into a different place.

uint32_t builtin_popcnt_unrolled_errata(const uint64_t* buf, int len) {
  assert(len % 4 == 0);
  int cnt[4];
  for (int i = 0; i < 4; ++i) {
    cnt[i] = 0;
  }

  for (int i = 0; i < len; i+=4) {
    cnt[0] += __builtin_popcountll(buf[i]);
    cnt[1] += __builtin_popcountll(buf[i+1]);
    cnt[2] += __builtin_popcountll(buf[i+2]);
    cnt[3] += __builtin_popcountll(buf[i+3]);
  }
  return cnt[0] + cnt[1] + cnt[2] + cnt[3];
}

And now we get

0000000100001420        popcntq (%rdi,%r9,8), %r8
0000000100001426        addl    %ebx, %r8d
0000000100001429        popcntq 0x8(%rdi,%r9,8), %rax
0000000100001430        addl    %r14d, %eax
0000000100001433        popcntq 0x10(%rdi,%r9,8), %rdx
000000010000143a        addl    %r11d, %edx
000000010000143d        popcntq 0x18(%rdi,%r9,8), %rcx

That's better -- we can see that the first popcnt outputs into r8, the second into rax, the third into rdx, and the fourth into rcx. However, this does the same odd accumulation as the original, where instead of adding the result of the popcnt to cnt[i], it does the opposite, which necessitates moving the results back to cnt[i] afterwards.

000000010000133e        movl    %ecx, %r10d
0000000100001341        movl    %edx, %r11d
0000000100001344        movl    %eax, %r14d
0000000100001347        movl    %r8d, %ebx

Well, at least in clang (3.4). Gcc (4.8.2) is too smart to fall for this separate destination thing and “optimizes” the code back to something like our original version.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M
Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5
PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6
Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2
Unrolled 2 14.3 16.3 17.0 17.2 17.2 17.0 16.8 16.7

To get a version that works with both gcc and clang, and doesn't have these extra movs, we'll have to write the assembly by hand3:

uint32_t builtin_popcnt_unrolled_errata_manual(const uint64_t* buf, int len) {
  assert(len % 4 == 0);
  uint64_t cnt[4];
  for (int i = 0; i < 4; ++i) {
    cnt[i] = 0;
  }

  for (int i = 0; i < len; i+=4) {
    __asm__(
        "popcnt %4, %4  \n\
        "add %4, %0     \n\t"
        "popcnt %5, %5  \n\t"
        "add %5, %1     \n\t"
        "popcnt %6, %6  \n\t"
        "add %6, %2     \n\t"
        "popcnt %7, %7  \n\t"
        "add %7, %3     \n\t" // +r means input/output, r means intput
        : "+r" (cnt[0]), "+r" (cnt[1]), "+r" (cnt[2]), "+r" (cnt[3])
        : "r"  (buf[i]), "r"  (buf[i+1]), "r"  (buf[i+2]), "r"  (buf[i+3]));
  }
  return cnt[0] + cnt[1] + cnt[2] + cnt[3];
}

This directly translates the assembly into the loop:

00000001000013c3        popcntq %r10, %r10
00000001000013c8        addq    %r10, %rcx
00000001000013cb        popcntq %r11, %r11
00000001000013d0        addq    %r11, %r9
00000001000013d3        popcntq %r14, %r14
00000001000013d8        addq    %r14, %r8
00000001000013db        popcntq %rbx, %rbx

Great! The adds are now going the right direction, because we specified exactly what they should do.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M
Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5
PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6
Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2
Unrolled 2 14.3 16.3 17.0 17.2 17.2 17.0 16.8 16.7
Assembly 17.5 23.7 25.3 25.3 26.3 26.3 25.3 24.3

Finally! A version that blows away the PSHUFB implementation. How do we know this should be the final version? We can see from Agner's instruction tables that we can execute, at most, one popcnt per cycle. I happen to have run this on a 3.4Ghz Sandy Bridge, so we've got an upper bound of 8 bytes / cycle * 3.4 G cycles / sec = 27.2 GB/s. That's pretty close to the 26.3 GB/s we're actually getting, which is a sign that we can't make this much faster4.

In this case, the hand coded assembly version is about 3x faster than the original intrinsic loop (not counting the version from a version of clang that didn't emit a popcnt). It happens that, for the compiler we used, the unrolled loop using the popcnt intrinsic is a bit faster than the pshufb version, but that wasn't true of one of the two unrolled versions when I tried this with gcc.

It's easy to see why someone might have benchmarked the same code and decided that popcnt isn't very fast. It's also easy to see why using intrinsics for performance critical code can be a huge time sink5.

Thanks to Scott for some comments on the organization of this post, and to Leah for extensive comments on just about everything

If you liked this, you'll probably enjoy this post about how CPUs have changed since the 80s.


  1. see this for the actual benchmarking code. On second thought, it's an embarrassingly terrible hack, and I'd prefer that you don't look. [return]
  2. If it were the other way around, and the hardware didn't realize there was a dependency when there should be, that would be easy to catch -- any sequence of instructions that was dependent might produce an incorrect result. In this case, some sequences of instructions are just slower than they should be, which is not trivial to check for. [return]
  3. This code is a simplified version of Alex Yee's stackoverflow answer about the popcnt false dependency bug [return]
  4. That's not quite right, since the CPU has TurboBoost, but it's pretty close. Putting that aside, this example is pretty simple, but calculating this stuff by hand can get tedious for more complicated code. Luckily, the Intel Architecture Code Analyzier can figure this stuff out for us. It finds the bottleneck in the code (assuming infinite memory bandwidth at zero latency), and displays how and why the processor is bottlenecked, which is usually enough to determine if there's room for more optimization.

    You might have noticed that the performance decreases as the buffer size becomes larger than our cache. It's possible to do a back of the envelope calculation to find the upper bound imposed by the limits of memory and cache performance, but working through the calculations would take a lot more space this this footnote has available to it. You can see a good example of how do it for one simple case here. The comments by Nathan Kurz and John McCaplin are particularly good.

    [return]
  5. In the course of running these benchmarks, I also noticed that _mm_cvtsi128_si64 produces bizarrely bad code on gcc (although it's fine in clang). _mm_cvtsi128_si64 is the intrinsic for moving an SSE (SIMD) register to a general purpose register (GPR). The compiler has a lot of latitude over whether or not a variable should live in a register or in memory. Clang realizes that it's probably faster to move the value from an SSE register to a GPR if the result is about to get used. Gcc decides to save a register and move the data from the SSE register to memory, and then have the next instruction operate on memory, if that's possible. In our popcnt example, clang uses about 2x for not unrolling the loop, and the rest comes from not being up to date on a CPU bug, which is understandable. It's hard to imagine why a compiler would do a register to memory move when it's about to operate on data unless it either doesn't do optimizations at all, or it has some bug which makes it unaware of the register to register version of the instruction. But at least it gets the right result, unlike this version of MSVC.

    icc and armcc are reputed to be better at dealing with intrinsics, but they're non starters for most open source projects. Downloading icc's free non-commercial version has been disabled for the better part of a year, and even if it comes back, who's going to trust that it won't disappear again? As for armcc, I'm not sure it's ever had a free version?

    [return]

Google wage fixing, 11-CV-02509-LHK, ORDER DENYING PLAINTIFFS' MOTION FOR PRELIMINARY APPROVAL OF SETTLEMENTS WITH ADOBE, APPLE, GOOGLE, AND INTEL

2014-08-14 08:00:00

UNITED STATES DISTRICT COURT
NORTHERN DISTRICT OF CALIFORNIA
SAN JOSE DIVISION

IN RE: HIGH-TECH EMPLOYEE
ANTITRUST LITIGATION

THIS DOCUMENT RELATES TO:
ALL ACTIONS

Case No.: 11-CV-02509-LHK

ORDER DENYING PLAINTIFFS' MOTION FOR PRELIMINARY APPROVAL OF SETTLEMENTS WITH ADOBE, APPLE, GOOGLE, AND INTEL

Before the Court is a Motion for Preliminary Approval of Class Action Settlement with Defendants Adobe Systems Inc. ("Adobe"), Apple Inc. ("Apple"), Google Inc. ("Google"), and Intel Corp. ("Intel") (hereafter, "Remaining Defendants") brought by three class representatives, Mark Fichtner, Siddharth Hariharan, and Daniel Stover (hereafter, "Plaintiffs"). See ECF No. 920. The Settlement provides for $324.5 million in recovery for the class in exchange for release of antitrust claims. A fourth class representative, Michael Devine ("Devine"), has filed an Opposition contending that the settlement amount is inadequate. See ECF No. 934. Plaintiffs have filed a Reply. See ECF No. 938. Plaintiffs, Remaining Defendants, and Devine appeared at a hearing on June 19, 2014. See ECF No. 940. In addition, a number of Class members have submitted letters in support of and in opposition to the proposed settlement. ECF Nos. 914, 949-51. The Court, having considered the briefing, the letters, the arguments presented at the hearing, and the record in this case, DENIES the Motion for Preliminary Approval for the reasons stated below.

I. BACKGROUND AND PROCEDURAL HISTORY

Michael Devine, Mark Fichtner, Siddharth Hariharan, and Daniel Stover, individually and on behalf of a class of all those similarly situated, allege antitrust claims against their former employers, Adobe, Apple, Google, Intel, Intuit Inc. ("Intuit"), Lucasfilm Ltd. ("Lucasfilm"), and Pixar (collectively, "Defendants"). Plaintiffs allege that Defendants entered into an overarching conspiracy through a series of bilateral agreements not to solicit each other's employees in violation of Section 1 of the Sherman Antitrust Act, 15 U.S.C. § 1, and Section 4 of the Clayton Antitrust Act, 15 U.S.C. § 15. Plaintiffs contend that the overarching conspiracy, made up of a series of six bilateral agreements (Pixar-Lucasfilm, Apple-Adobe, Apple-Google, Apple-Pixar, Google-Intuit, and Google-Intel) suppressed wages of Defendants' employees.

The five cases underlying this consolidated action were initially filed in California Superior Court and removed to federal court. See ECF No. 532 at 5. The cases were related by Judge Saundra Brown Armstrong, who also granted a motion to transfer the related actions to the San Jose Division. See ECF Nos. 52, 58. After being assigned to the undersigned judge, the cases were consolidated pursuant to the parties' stipulation. See ECF No. 64. Plaintiffs filed a consolidated complaint on September 23, 2011, see ECF No. 65, which Defendants jointly moved to dismiss, see ECF No. 79. In addition, Lucasfilm filed a separate motion to dismiss on October 17, 2011. See ECF No. 83. The Court granted in part and denied in part the joint motion to dismiss and denied Lucasfilm's separate motion to dismiss. See ECF No. 119.

On October 1, 2012, Plaintiffs filed a motion for class certification. See ECF No. 187. The motion sought certification of a class of all of the seven Defendants' employees or, in the alternative, a narrower class of just technical employees of the seven Defendants. After full briefing and a hearing, the Court denied class certification on April 5, 2013. See ECF No. 382. The Court was concerned that Plaintiffs' documentary evidence and empirical analysis were insufficient to determine that common questions predominated over individual questions with respect to the issue of antitrust impact. See id. at 33. Moreover, the Court expressed concern that there was insufficient analysis in the class certification motion regarding the class of technical employees. Id. at 29. The Court afforded Plaintiffs leave to amend to address the Court's concerns. See id. at 52.

On May 10, 2013, Plaintiffs filed their amended class certification motion, seeking to certify only the narrower class of technical employees. See ECF No. 418. Defendants filed their opposition on June 21, 2013, ECF No. 439, and Plaintiffs filed their reply on July 12, 2013, ECF No. 455. The hearing on the amended motion was set for August 5, 2013.

On July 12 and 30, 2013, after class certification had been initially denied and while an amended motion was pending, Plaintiffs settled with Pixar, Lucasfilm, and Intuit (hereafter, "Settled Defendants"). See ECF Nos. 453, 489. Plaintiffs filed a motion for preliminary approval of the settlements with Settled Defendants on September 21, 2013. See ECF No. 501. No opposition to the motion was filed, and the Court granted the motion on October 30, 2013, following a hearing on October 21, 2013. See ECF No. 540. The Court held a fairness hearing on May 1, 2014, ECF No. 913, and granted final approval of the settlements and accompanying requests for attorneys' fees, costs, and incentive awards over five objections on May 16, 2014, ECF Nos. 915-16. Judgment was entered as to the Settled Defendants on June 20, 2014. ECF No. 947.

After the Settled Defendants settled, this Court certified a class of technical employees of the seven Defendants (hereafter, "the Class") on October 25, 2013 in an 86-page order granting Plaintiffs' amended class certification motion. See ECF No. 532. The Remaining Defendants petitioned the Ninth Circuit to review that order under Federal Rule of Civil Procedure 23(f). After full briefing, including the filing of an amicus brief by the National and California Chambers of Commerce and the National Association of Manufacturing urging the Ninth Circuit to grant review, the Ninth Circuit denied review on January 15, 2014. See ECF No. 594.

Meanwhile, in this Court, the Remaining Defendants filed a total of five motions for summary judgment and filed motions to strike and to exclude the testimony of Plaintiffs' principal expert on antitrust impact and damages, Dr. Edward Leamer, who opined that the total damages to the Class exceeded $3 billion in wages Class members would have earned in the absence of the anti-solicitation agreements.1 The Court denied the motions for summary judgment on March 28, 2014, and on April 4, 2014, denied the motion to exclude Dr. Leamer and denied in large part the motion to strike Dr. Leamer's testimony. ECF Nos. 777, 788.

On April 24, 2014, counsel for Plaintiffs and counsel for Remaining Defendants sent a joint letter to the Court indicating that they had reached a settlement. See ECF No. 900. This settlement was reached two weeks before the Final Pretrial Conference and one month before the trial was set to commence.2: Upon receipt of the joint letter, the Court vacated the trial date and pretrial deadlines and set a schedule for preliminary approval. See ECF No. 904. Shortly after counsel sent the letter, the media disclosed the total amount of the settlement, and this Court received three letters from individuals, not including Devine, objecting to the proposed settlement in response to media reports of the settlement amount.3 See ECF No. 914. On May 22, 2014, in accordance with this Court's schedule, Plaintiffs filed their Motion for Preliminary Approval. See ECF No. 920. Devine filed an Opposition on June 5, 2014.4 See ECF No. 934. Plaintiffs filed a Reply on June 12, 2014. See ECF No. 938. The Court held a hearing on June 19, 2014. See ECF No. 948. After the hearing, the Court received a letter from a Class member in opposition to the proposed settlement and two letters from Class members in support of the proposed settlement. See ECF Nos. 949-51.

The Court must review the fairness of class action settlements under Federal Rule of Civil Procedure 23(e). The Rule states that "[t]he claims, issues, or defenses of a certified class may be settled, voluntarily dismissed, or compromised only with the court's approval." The Rule requires the Court to "direct notice in a reasonable manner to all class members who would be bound by the proposal" and further states that if a settlement "would bind class members, the court may approve it only after a hearing and on finding that it is fair, reasonable, and adequate." Fed. R. Civ. P. 23(e)(1)-(2). The principal purpose of the Court's supervision of class action settlements is to ensure "the agreement is not the product of fraud or overreaching by, or collusion between, the negotiating parties." Officers for Justice v. Civil Serv. Comm'n of City & Cnty. of S.F., 688 F.2d 615, 625 (9th Cir. 1982).

District courts have interpreted Rule 23(e) to require a two-step process for the approval of class action settlements: "the Court first determines whether a proposed class action settlement deserves preliminary approval and then, after notice is given to class members, whether final approval is warranted." Nat'l Rural Telecomms. Coop. v. DIRECTV, Inc., 221 F.R.D. 523, 525 (C.D. Cal. 2004). At the final approval stage, the Ninth Circuit has stated that "[a]ssessing a settlement proposal requires the district court to balance a number of factors: the strength of the plaintiffs' case; the risk, expense, complexity, and likely duration of further litigation; the risk of maintaining class action status throughout the trial; the amount offered in settlement; the extent of discovery completed and the stage of the proceedings; the experience and views of counsel; the presence of a governmental participant; and the reaction of the class members to the proposed settlement." Hanlon v. Chrysler Corp., 150 F.3d 1011, 1026 (9th Cir. 1998).

In contrast to these well-established, non-exhaustive factors for final approval, there is relatively scant appellate authority regarding the standard that a district court must apply in reviewing a settlement at the preliminary approval stage. Some district courts, echoing commentators, have stated that the relevant inquiry is whether the settlement "falls within the range of possible approval" or "within the range of reasonableness." In re Tableware Antitrust Litig., 484 F. Supp. 2d 1078, 1079 (N.D. Cal. 2007); see also Cordy v. USS-Posco Indus., No. 12-553, 2013 WL 4028627, at *3 (N.D. Cal. Aug. 1, 2013) ("Preliminary approval of a settlement and notice to the proposed class is appropriate if the proposed settlement appears to be the product of serious, informed, non-collusive negotiations, has no obvious deficiencies, does not improperly grant preferential treatment to class representatives or segments of the class, and falls with the range of possible approval." (internal quotation marks omitted)). To undertake this analysis, the Court "must consider plaintiffs' expected recovery balanced against the value of the settlement offer." In re Nat'l Football League Players' Concussion Injury Litig., 961 F. Supp. 2d 708, 714 (E.D. Pa. 2014) (internal quotation marks omitted).

III. DISCUSSION

Pursuant to the terms of the instant settlement, Class members who have not already opted out and who do not opt out will relinquish their rights to file suit against the Remaining Defendants for the claims at issue in this case. In exchange, Remaining Defendants will pay a total of $324.5 million, of which Plaintiffs' counsel may seek up to 25% (approximately $81 million) in attorneys' fees, $1.2 million in costs, and $80,000 per class representative in incentive payments. In addition, the settlement allows Remaining Defendants a pro rata reduction in the total amount they must pay if more than 4% of Class members opt out after receiving notice.5 Class members would receive an average of approximately $3,7506 from the instant settlement if the Court were to grant all requested deductions and there were no further opt-outs.7

The Court finds the total settlement amount falls below the range of reasonableness. The Court is concerned that Class members recover less on a proportional basis from the instant settlement with Remaining Defendants than from the settlement with the Settled Defendants a year ago, despite the fact that the case has progressed consistently in the Class's favor since then. Counsel's sole explanation for this reduced figure is that there are weaknesses in Plaintiffs' case such that the Class faces a substantial risk of non-recovery. However, that risk existed and was even greater when Plaintiffs settled with the Settled Defendants a year ago, when class certification had been denied.

The Court begins by comparing the instant settlement with Remaining Defendants to the settlements with the Settled Defendants, in light of the facts that existed at the time each settlement was reached. The Court then discusses the relative strengths and weaknesses of Plaintiffs' case to assess the reasonableness of the instant settlement.

A. Comparison to the Initial Settlements

1. Comparing the Settlement Amounts

The Court finds that the settlements with the Settled Defendants provide a useful benchmark against which to analyze the reasonableness of the instant settlement. The settlements with the Settled Defendants led to a fund totaling $20 million. See ECF No. 915 at 3. In approving the settlements, the Court relied upon the fact that the Settled Defendants employed 8% of Class members and paid out 5% of the total Class compensation during the Class period. See ECF No. 539 at 16:20-22 (Plaintiffs' counsel's explanation at the preliminary approval hearing with the Settled Defendants that the 5% figure "giv[es] you a sense of how big a slice of the case this settlement is relative to the rest of the case"). If Remaining Defendants were to settle at the same (or higher) rate as the Settled Defendants, Remaining Defendants' settlement fund would need to total at least $380 million. This number results from the fact that Remaining Defendants paid out 95% of the Class compensation during the Class period, while Settled Defendants paid only 5% of the Class compensation during the Class period.8

At the hearing on the instant Motion, counsel for Remaining Defendants suggested that the relevant benchmark is not total Class compensation, but rather is total Class membership. This would result in a benchmark figure for the Remaining Defendants of $230 million0. At a minimum, counsel suggested, the Court should compare the settlement amount to a range of $230 million to $380 million, within which the instant settlement falls. The Court rejects counsel's suggestion, which is contrary to the record. Counsel has provided no basis for why the number of Class members employed by each Defendant is a relevant metric. To the contrary, the relevant inquiry has always been total Class compensation. For example, in both of the settlements with the Settled Defendants and in the instant settlement, the Plans of Allocation call for determining each individual Class member's pay out by dividing the Class member's compensation during the Class period by the total Class compensation during the Class period. ECF No. 809 at 6 (noting that the denominator in the plan of allocation in the settlements with the Settled Defendants is the "total of base salaries paid to all approved Claimants in class positions during the Class period"); ECF No. 920 at 22 (same in the instant settlement); see also ECF No. 539 at 16:20-22 (Plaintiffs' counsel's statement that percent of the total Class compensation was relevant for benchmarking the settlements with the Settled Defendants to the rest of the case). At no point in the record has the percentage of Class membership employed by each Defendant ever been the relevant factor for determining damages exposure. Accordingly, the Court rejects the metric proposed by counsel for Remaining Defendants. Using the Settled Defendants' settlements as a yardstick, the appropriate benchmark settlement for the Remaining Defendants would be at least $380 million, more than $50 million greater than what the instant settlement provides.

Counsel for Remaining Defendants also suggested that benchmarking against the initial settlements would be inappropriate because the magnitude of the settlement numbers for Remaining Defendants dwarfs the numbers at issue in the Settled Defendants' settlements. This argument is premised on the idea that Defendants who caused more damage to the Class and who benefited more by suppressing a greater portion of class compensation should have to pay less than Defendants who caused less damage and who benefited less from the allegedly wrongful conduct. This argument is unpersuasive. Remaining Defendants are alleged to have received 95% of the benefit of the anti-solicitation agreements and to have caused 95% of the harm suffered by the Class in terms of lost compensation. Therefore, Remaining Defendants should have to pay at least 95% of the damages, which, under the instant settlement, they would not.

The Court also notes that had Plaintiffs prevailed at trial on their more than $3 billion damages claim, antitrust law provides for automatic trebling, see 15 U.S.C. § 15(a), so the total damages award could potentially have exceeded $9 billion. While the Ninth Circuit has not determined whether settlement amounts in antitrust cases must be compared to the single damages award requested by Plaintiffs or the automatically trebled damages amount, see Rodriguez v. W. Publ'g Corp., 563 F.3d 948, 964-65 (9th Cir. 2009), the instant settlement would lead to a total recovery of 11.29% of the single damages proposed by Plaintiffs' expert or 3.76% of the treble damages. Specifically, Dr. Leamer has calculated the total damages to the Class resulting from Defendants' allegedly unlawful conduct as $3.05 billion. See ECF No. 856-10. If the Court approves the instant settlements, the total settlements with all Defendants would be $344.5 million. This total would amount to 11.29% of the single damages that Dr. Leamer opines the Class suffered or 3.76% if Dr. Leamer's damages figure had been trebled.

2. Relative Procedural Posture

The discount that Remaining Defendants have received vis-a-vis the Settled Defendants is particularly troubling in light of the changes in the procedural posture of the case between the two settlements, changes that the Court would expect to have increased, rather than decreased, Plaintiffs' bargaining power. Specifically, at the time the Settled Defendants settled, Plaintiffs were at a particularly weak point in their case. Though Plaintiffs had survived Defendants' motion to dismiss, Plaintiffs' motion for class certification had been denied, albeit without prejudice. Plaintiffs had re-briefed the class certification motion, but had no class certification ruling in their favor at the time they settled with the Settled Defendants. If the Court ultimately granted certification, Plaintiffs also did not know whether the Ninth Circuit would grant Federal Rule of Civil Procedure 23(f) review and reverse the certification. Accordingly, at that point, Defendants had significant leverage.

In contrast, the procedural posture of the case swung dramatically in Plaintiffs' favor after the initial settlements were reached. Specifically, the Court certified the Class over the vigorous objections of Defendants. In the 86-page order granting class certification, the Court repeatedly referred to Plaintiffs' evidence as "substantial" and "extensive," and the Court stated that it "could not identify a case at the class certification stage with the level of documentary evidence Plaintiffs have presented in the instant case." ECF No. 531 at 69. Thereafter, the Ninth Circuit denied Defendants' request to review the class certification order under Federal Rule of Civil Procedure 23(f). This Court also denied Defendants' five motions for summary judgment and denied Defendants' motion to exclude Plaintiffs' principal expert on antitrust impact and damages. The instant settlement was reached a mere two weeks before the final pretrial conference and one month before a trial at which damaging evidence regarding Defendants would have been presented.

In sum, Plaintiffs were in a much stronger position at the time of the instant settlement—after the Class had been certified, appellate review of class certification had been denied, and Defendants' dispositive motions and motion to exclude Dr. Leamer's testimony had been denied—than they were at the time of the settlements with the Settled Defendants, when class certification had been denied. This shift in the procedural posture, which the Court would expect to have increased Plaintiffs' bargaining power, makes the more recent settlements for a proportionally lower amount even more troubling.

B. Strength of Plaintiffs' Case

The Court now turns to the strength of Plaintiffs' case against the Remaining Defendants to evaluate the reasonableness of the settlement.

At the hearing on the instant Motion, Plaintiffs' counsel contended that one of the reasons the instant settlement was proportionally lower than the previous settlements is that the documentary evidence against the Settled Defendants (particularly, Lucasfilm and Pixar) is more compelling than the documentary evidence against the Remaining Defendants. As an initial matter, the Court notes that relevant evidence regarding the Settled Defendants would be admissible at a trial against Remaining Defendants because Plaintiffs allege an overarching conspiracy that included all Defendants. Accordingly, evidence regarding the role of Lucasfilm and Pixar in the creation of and the intended effect of the overarching conspiracy would be admissible.

Nonetheless, the Court notes that Plaintiffs are correct that there are particularly clear statements from Lucasfilm and Pixar executives regarding the nature and goals of the alleged conspiracy. Specifically, Edward Catmull (Pixar President) conceded in his deposition that anti-solicitation agreements were in place because solicitation "messes up the pay structure." ECF No. 431-9 at 81. Similarly, George Lucas (former Lucasfilm Chairman of the Board and CEO) stated, "we cannot get into a bidding war with other companies because we don't have the margins for that sort of thing." ECF No. 749-23 at 9.

However, there is equally compelling evidence that comes from the documents of the Remaining Defendants. This is particularly true for Google and Apple, the executives of which extensively discussed and enforced the anti-solicitation agreements. Specifically, as discussed in extensive detail in this Court's previous orders, Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple, Former CEO of Pixar), Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO), and Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google) were key players in creating and enforcing the anti-solicitation agreements. The Court now turns to the evidence against the Remaining Defendants that the finder of fact is likely to find compelling.

There is substantial and compelling evidence that Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple, Former CEO of Pixar) was a, if not the, central figure in the alleged conspiracy. Several witnesses, in their depositions, testified to Mr. Jobs' role in the anti-solicitation agreements. For example, Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) stated that Mr. Jobs "believed that you should not be hiring each others', you know, technical people" and that "it was inappropriate in [Mr. Jobs'] view for us to be calling in and hiring people." ECF No. 819-12 at 77. Edward Catmull (Pixar President) stated that Mr. Jobs "was very adamant about protecting his employee force." ECF No. 431-9 at 97. Sergey Brin (Google Co-Founder) testified that "I think Mr. Jobs' view was that people shouldn't piss him off. And I think that things that pissed him off were—would be hiring, you know—whatever." ECF No. 639-1 at 112. There would thus be ample evidence Mr. Jobs was involved in expanding the original anti-solicitation agreement between Lucasfilm and Pixar to the other Defendants in this case. After the agreements were extended, Mr. Jobs played a central role in enforcing these agreements. Four particular sets of evidence are likely to be compelling to the fact-finder.

First, after hearing that Google was trying to recruit employees from Apple's Safari team, Mr. Jobs threatened Mr. Brin, stating, as Mr. Brin recounted, "if you hire a single one of these people that means war." ECF No. 833-15.9 In an email to Google's Executive Management Team as well as Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), Mr. Brin advised: "lets [sic] not make any new offers or contact new people at Apple until we have had a chance to discuss." Id. Mr. Campbell then wrote to Mr. Jobs: "Eric [Schmidt] told me that he got directly involved and firmly stopped all efforts to recruit anyone from Apple." ECF No. 746-5. As Mr. Brin testified in his deposition, "Eric made a—you know, a—you know, at least some kind of—had a conversation with Bill to relate to Steve to calm him down." ECF No. 639-1 at 61. As Mr. Schmidt put it, "Steve was unhappy, and Steve's unhappiness absolutely influenced the change we made in recruiting practice." ECF No. 819-12 at 21. Danielle Lambert (Apple's head of Human Resources) reciprocated to maintain Apple's end of the anti-solicitation agreements, instructing Apple recruiters: "Please add Google to your 'hands-off list. We recently agreed not to recruit from one another so if you hear of any recruiting they are doing against us, please be sure to let me know." ECF No. 746-15.

Second, other Defendants' CEOs maintained the anti-solicitation agreements out of fear of and deference to Mr. Jobs. For example, in 2005, when considering whether to enter into an anti-solicitation agreement with Apple, Bruce Chizen (former Adobe CEO), expressed concerns about the loss of "top talent" if Adobe did not enter into an anti-solicitation agreement with Apple, stating, "if I tell Steve it's open season (other than senior managers), he will deliberately poach Adobe just to prove a point. Knowing Steve, he will go after some of our top Mac talent like Chris Cox and he will do it in a way in which they will be enticed to come (extraordinary packages and Steve wooing)."10 ECF No. 297-15.

This was the genesis of the Apple-Adobe agreement. Specifically, after Mr. Jobs complained to Mr. Chizen on May 26, 2005 that Adobe was recruiting Apple employees, ECF No. 291-17, Mr. Chizen responded by saying, "I thought we agreed not to recruit any senior level employees . . . . I would propose we keep it that way. Open to discuss. It would be good to agree." Id. Mr. Jobs was not satisfied, and replied by threatening to send Apple recruiters after Adobe's employees: "OK, I'll tell our recruiters that they are free to approach any Adobe employee who is not a Sr. Director or VP. Am I understanding your position correctly?" Id. Mr. Chizen immediately gave in: "I'd rather agree NOT to actively solicit any employee from either company . . . . If you are in agreement I will let my folks know." Id. (emphasis in original). The next day, Theresa Townsley (Adobe Vice President Human Resources) announced to her recruiting team, "Bruce and Steve Jobs have an agreement that we are not to solicit ANY Apple employees, and vice versa." ECF No. 291-18 (emphasis in original). Adobe then placed Apple on its "[c]ompanies that are off limits" list, which instructed Adobe employees not to cold call Apple employees. ECF No. 291-11.

Google took even more drastic actions in response to Mr. Jobs. For example, when a recruiter from Google's engineering team contacted an Apple employee in 2007, Mr. Jobs forwarded the message to Mr. Schmidt and stated, "I would be very pleased if your recruiting department would stop doing this." ECF No. 291-23. Google responded by making a "public example" out of the recruiter and "terminat[ing] [the recruiter] within the hour." Id. The aim of this public spectacle was to "(hopefully) prevent future occurrences." Id. Once the recruiter was terminated, Mr. Schmidt emailed Mr. Jobs, apologizing and informing Mr. Jobs that the recruiter had been terminated. Mr. Jobs forwarded Mr. Schmidt's email to an Apple human resources official and stated merely, ":)." ECF No. 746-9.

A year prior to this termination, Google similarly took seriously Mr. Jobs' concerns. Specifically, in 2006, Mr. Jobs emailed Mr. Schmidt and said, "I am told that Googles [sic] new cell phone software group is relentlessly recruiting in our iPod group. If this is indeed true, can you put a stop to it?" ECF No. 291-24 at 3. After Mr. Schmidt forwarded this to Human Resources professionals at Google, Arnnon Geshuri (Google Recruiting Director) prepared a detailed report stating that an extensive investigation did not find a breach of the anti-solicitation agreement.

Similarly, in 2006, Google scrapped plans to open a Google engineering center in Paris after a Google executive emailed Mr. Jobs to ask whether Google could hire three former Apple engineers to work at the prospective facility, and Mr. Jobs responded "[w]e'd strongly prefer that you not hire these guys." ECF No. 814-2. The whole interaction began with Google's request to Steve Jobs for permission to hire Jean-Marie Hullot, an Apple engineer. The record is not clear whether Mr. Hullot was a current or former Apple employee. A Google executive contacted Steve Jobs to ask whether Google could make an offer to Mr. Hullot, and Mr. Jobs did not timely respond to the Google executive's request. At this point, the Google executive turned to Intuit's Board Chairman Bill Campbell as a potential ambassador from Google to Mr. Jobs. Specifically, the Google executive noted that Mr. Campbell "is on the board at Apple and Google, so Steve will probably return his call." ECF No. 428-6. The same day that Mr. Campbell reached out to Mr. Jobs, Mr. Jobs responded to the Google executive, seeking more information on what exactly the Apple engineer would be working. ECF No. 428-9. Once Mr. Jobs was satisfied, he stated that the hire "would be fine with me." Id. However, two weeks later, when Mr. Hullot and a Google executive sought Mr. Jobs' permission to hire four of Mr. Hullot's former Apple colleagues (three were former Apple employees and one had given notice of impending departure from Apple), Mr. Jobs promptly responded, indicating that the hires would not be acceptable. ECF No. 428-9. Google promptly scrapped the plan, and the Google executive responded deferentially to Mr. Jobs, stating, "Steve, Based on your strong preference that we not hire the ex-Apple engineers, Jean-Marie and I decided not to open a Google Paris engineering center." Id. The Google executive also forwarded the email thread to Mr. Brin, Larry Page (Google Co-Founder), and Mr. Campbell. Id.

Third, Mr. Jobs attempted (unsuccessfully) to expand the anti-solicitation agreements to Palm, even threatening litigation. Specifically, Mr. Jobs called Edward Colligan (former President and CEO of Palm) to ask Mr. Colligan to enter into an anti-solicitation agreement and threatened patent litigation against Palm if Palm refused to do so. ECF No. 293 ¶¶ 6-8. Mr. Colligan responded via email, and told Mr. Jobs that Mr. Jobs' "proposal that we agree that neither company will hire the other's employees, regardless of the individual's desires, is not only wrong, it is likely illegal." Id. at 4-5. Mr. Colligan went on to say that, "We can't dictate where someone will work, nor should we try. I can't deny people who elect to pursue their livelihood at Palm the right to do so simply because they now work for Apple, and I wouldn't want you to do that to current Palm employees." Id. at 5. Finally, Mr. Colligan wrote that "[t]hreatening Palm with a patent lawsuit in response to a decision by one employee to leave Apple is just out of line. A lawsuit would not serve either of our interests, and will not stop employees from migrating between our companies . . . . We will both just end up paying a lot of lawyers a lot of money." Id. at 5-6. Mr. Jobs wrote the following back to Mr. Colligan: "This is not satisfactory to Apple." Id. at 8. Mr. Jobs went on to write that "I'm sure you realize the asymmetry in the financial resources of our respective companies when you say: 'we will both just end up paying a lot of lawyers a lot of money.'" Id. Mr. Jobs concluded: "My advice is to take a look at our patent portfolio before you make a final decision here." Id.

Fourth, Apple's documents provide strong support for Plaintiffs' theory of impact, namely that rigid wage structures and internal equity concerns would have led Defendants to engage in structural changes to compensation structures to mitigate the competitive threat that solicitation would have posed. Apple's compensation data shows that, for each year in the Class period, Apple had a "job structure system," which included categorizing and compensating its workforce according to a discrete set of company-wide job levels assigned to all salaried employees and four associated sets of base salary ranges applicable to "Top," "Major," "National," and "Small" geographic markets. ECF No. 745-7 at 14-15, 52-53; ECF No.517-16 ¶¶ 6, 10 & Ex. B. Every salary range had a "min," "mid," and "max" figure. See id. Apple also created a Human Resources and recruiting tool called "Merlin," which was an internal system for tracking employee records and performance, and required managers to grade employees at one of four pre-set levels. See ECF No. 749-6 at 142-43, 145-46; ECF No. 749-11 at 52-53; ECF No. 749-12 at 33. As explained by Tony Fadell (former Apple Senior Vice President, iPod Division, and advisor to Steve Jobs), Merlin "would say, this is the employee, this is the level, here are the salary ranges, and through that tool we were then—we understood what the boundaries were." ECF No. 749-11 at 53. Going outside these prescribed "guidelines" also required extra approval. ECF No. 749-7 at 217; ECF No. 749-11 at 53 ("And if we were to go outside of that, then we would have to pull in a bunch of people to then approve anything outside of that range.").

Concerns about internal equity also permeated Apple's compensation program. Steven Burmeister (Apple Senior Director of Compensation) testified that internal equity—which Mr. Burmeister defined as the notion of whether an employee's compensation is "fair based on the individual's contribution relative to the other employees in your group, or across your organization"—inheres in some, "if not all," of the guidelines that managers consider in determining starting salaries. ECF No. 745-7 at 61-64; ECF No. 753-12. In fact, as explained by Patrick Burke (former Apple Technical Recruiter and Staffing Manager), when hiring a new employee at Apple, "compar[ing] the candidate" to the other people on the team they would join "was the biggest determining factor on what salary we gave." ECF No. 745-6 at 279.

The evidence against Google is equally compelling. Email evidence reveals that Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) terminated at least two recruiters for violations of anti-solicitation agreements, and threatened to terminate more. As discussed above, there is direct evidence that Mr. Schmidt terminated a recruiter at Steve Jobs' behest after the recruiter attempted to solicit an Apple employee. Moreover, in an email to Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), Mr. Schmidt indicated that he directed a for-cause termination of another Google recruiter, who had attempted to recruit an executive of eBay, which was on Google's do-not-cold-call list. ECF No. 814-14. Finally, as discussed in more detail below, Mr. Schmidt informed Paul Otellini (CEO of Intel and Member of the Google Board of Directors) that Mr. Schmidt would terminate any recruiter who recruited Intel employees.

Furthermore, Google maintained a formal "Do Not Call" list, which grouped together Apple, Intel, and Intuit and was approved by top executives. ECF No. 291-28. The list also included other companies, such as Genentech, Paypal, and eBay. Id. A draft of the "Do Not Call" list was presented to Google's Executive Management Group, a committee consisting of Google's senior executives, including Mr. Schmidt, Larry Page (Google Co-Founder), Sergey Brin (Google Co-Founder), and Shona Brown (former Google Senior Vice President of Business Operations). ECF No. 291-26. Mr. Schmidt approved the list. See id.; see also ECF No. 291-27 (email from Mr. Schmidt stating: "This looks very good."). Moreover, there is evidence that Google executives knew that the anti-solicitation agreements could lead to legal troubles, but nevertheless proceeded with the agreements. When Ms. Brown asked Mr. Schmidt whether he had any concerns with sharing information regarding the "Do Not Call" list with Google's competitors, Mr. Schmidt responded that he preferred that it be shared "verbally[,] since I don't want to create a paper trail over which we can be sued later?" ECF No. 291-40. Ms. Brown responded: "makes sense to do orally. i agree." Id.

Google's response to competition from Facebook also demonstrates the impact of the alleged conspiracy. Google had long been concerned about Facebook hiring's effect on retention. For example, in an email to top Google executives, Mr. Brin in 2007 stated that "the facebook phenomenon creates a real retention problem." ECF No. 814-4. A month later, Mr. Brin announced a policy of making counteroffers within one hour to any Google employee who received an offer from Facebook. ECF No. 963-2.

In March 2008, Arnnon Geshuri (Google Recruiting Director) discovered that non-party Facebook had been cold calling into Google's Site Reliability Engineering ("SRE") team. Mr. Geshuri's first response was to suggest contacting Sheryl Sandberg (Chief Operating Officer for non-party Facebook) in an effort to "ask her to put a stop to the targeted sourcing effort directed at our SRE team" and "to consider establishing a mutual 'Do Not Call' agreement that specifies that we will not cold-call into each other." ECF No. 963-3. Mr. Geshuri also suggested "look[ing] internally and review[ing] the attrition rate for the SRE group," stating, "[w]e may want to consider additional individual retention incentives or team incentives to keep attrition as low as possible in SRE." Id. (emphasis added). Finally, an alternative suggestion was to "[s]tart an aggressive campaign to call into their company and go after their folks—no holds barred. We would be unrelenting and a force of nature." Id. In response, Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), in his capacity as an advisor to Google, suggested "Who should contact Sheryl Sandberg to get a cease fire? We have to get a truce." Id. Facebook refused.

In 2010, Google altered its salary structure with a "Big Bang" in response to Facebook's hiring, which provides additional support for Plaintiffs' theory of antitrust impact. Specifically, after a period in which Google lost a significant number of employees to Facebook, Google began to study Facebook's solicitation of Google employees. ECF No. 190 ¶ 109. One month after beginning this study, Google announced its "Big Bang," which involved an increase to the base salary of all of its salaried employees by 10% and provided an immediate cash bonus of $1,000 to all employees. ECF No. 296-18. Laszlo Bock (Google Senior Vice President of People Operations) explained that the rationale for the Big Bang included: (1) being "responsive to rising attrition;" (2) supporting higher retention because "higher salaries generate higher fixed costs;" and (3) being "very strategic because start-ups don't have the cash flow to match, and big companies are (a) too worried about internal equity and scalability to do this and (b) don't have the margins to do this." ECF No. 296-20.

Other Google documents provide further evidence of Plaintiffs' theory of antitrust impact. For example, Google's Chief Culture Officer stated that "[c]old calling into companies to recruit is to be expected unless they're on our 'don't call' list." ECF No. 291-41. Moreover, Google found that although referrals were the largest source of hires, "agencies and passively sourced candidates offer[ed] the highest yield." ECF No. 780-8. The spread of information between employees had there been active solicitations—which is central to Plaintiffs' theory of impact—is also demonstrated in Google's evidence. For example, one Google employee states that "[i]t's impossible to keep something like this a secret. The people getting counter offers talk, not just to Googlers and ex-Googlers, but also to the competitors where they received their offers (in the hopes of improving them), and those competitors talk too, using it as a tool to recruit more Googlers." ECF No. 296-23.

The wage structure and internal equity concerns at Google also support Plaintiffs' theory of impact. Google had many job families, many grades within job families, and many job titles within grades. See, e.g., ECF No. 298-7, ECF No. 298-8; see also Cisneros Decl., Ex. S (Brown Depo.) at 74-76 (discussing salary ranges utilized by Google); ECF No. 780-4 at 25-26 (testifying that Google's 2007 salary ranges had generally the same structure as the 2004 salary ranges). Throughout the Class period, Google utilized salary ranges and pay bands with minima and maxima and either means or medians. ECF No. 958-1 ¶ 66; see ECF No. 427-3 at 15-17. As explained by Shona Brown (former Google Senior Vice President, Business Operations), "if you discussed a specific role [at Google], you could understand that role was at a specific level on a certain job ladder." ECF No. 427-3 at 27-28; ECF No. 745-11. Frank Wagner (Google Director of Compensation) testified that he could locate the target salary range for jobs at Google through an internal company website. See ECF No. 780-4 at 31-32 ("Q: And if you wanted to identify what the target salary would be for a certain job within a certain grade, could you go online or go to some place . . . and pull up what that was for that job family and that grade? . . . A: Yes."). Moreover, Google considered internal equity to be an important goal. Google utilized a salary algorithm in part for the purpose of "[e]nsur[ing] internal equity by managing salaries within a reasonable range." ECF No. 814-19. Furthermore, because Google "strive[d] to achieve fairness in overall salary distribution," "high performers with low salaries [would] get larger percentage increases than high performers with high salaries." ECF No. 817-1 at 15.

In addition, Google analyzed and compared its equity compensation to Apple, Intel, Adobe, and Intuit, among other companies, each of which it designated as a "peer company" based on meeting criteria such as being a "high-tech company," a "high-growth company," and a "key labor market competitor." ECF No. 773-1. In 2007, based in part on an analysis of Google as compared to its peer companies, Mr. Bock and Dave Rolefson (Google Equity Compensation Manager) wrote that "[o]ur biggest labor market competitors are significantly exceeding their own guidelines to beat Google for talent." Id.

Finally, Google's own documents undermine Defendants' principal theory of lack of antitrust impact, that compensation decisions would be one off and not classwide. Alan Eustace (Google Senior Vice President) commented on concerns regarding competition for workers and Google's approach to counteroffers by noting that, "it sometimes makes sense to make changes in compensation, even if it introduces discontinuities in your current comp, to save your best people, and send a message to the hiring company that we'll fight for our best people." ECF No. 296-23. Because recruiting "a few really good people" could inspire "many, many others [to] follow," Mr. Eustace concluded, "[y]ou can't afford to be a rich target for other companies." Id. According to him, the "long-term . . . right approach is not to deal with these situations as one-offs but to have a systematic approach to compensation that makes it very difficult for anyone to get a better offer." Id. (emphasis added).

Google's impact on the labor market before the anti-solicitation agreements was best summarized by Meg Whitman (former CEO of eBay) who called Mr. Schmidt "to talk about [Google's] hiring practices." ECF No. 814-15. As Eric Schmidt told Google's senior executives, Ms. Whitman said "Google is the talk of the valley because [you] are driving up salaries across the board." Id. A year after this conversation, Google added eBay to its do-not-cold-call list. ECF No. 291-28.

There is also compelling evidence against Intel. Google reacted to requests regarding enforcement of the anti-solicitation agreement made by Intel executives similarly to Google's reaction to Steve Jobs' request to enforce the agreements discussed above. For example, after Paul Otellini (CEO of Intel and Member of the Google Board of Directors) received an internal complaint regarding Google's successful recruiting efforts of Intel's technical employees on September 26, 2007, ECF No. 188-8 ("Paul, I am losing so many people to Google . . . . We are countering but thought you should know."), Mr. Otellini forwarded the email to Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) and stated "Eric, can you pls help here???" Id. Mr. Schmidt obliged and forwarded the email to his recruiting team, who prepared a report for Mr. Schmidt on Google's activities. ECF No. 291-34. The next day, Mr. Schmidt replied to Mr. Otellini, "If we find that a recruiter called into Intel, we will terminate the recruiter," the same remedy afforded to violations of the Apple-Google agreement. ECF No. 531 at 37. In another email to Mr. Schmidt, Mr. Otellini stated, "Sorry to bother you again on this topic, but my guys are very troubled by Google continuing to recruit our key players." See ECF No. 428-8.

Moreover, Mr. Otellini was aware that the anti-solicitation agreement could be legally troublesome. Specifically, Mr. Otellini stated in an email to another Intel executive regarding the Google-Intel agreement: "Let me clarify. We have nothing signed. We have a handshake 'no recruit' between eric and myself. I would not like this broadly known." Id.

Furthermore, there is evidence that Mr. Otellini knew of the anti-solicitation agreements to which Intel was not a party. Specifically, both Sergey Brin (Google Co-Founder) and Mr. Schmidt of Google testified that they would have told Mr. Otellini that Google had an anti-solicitation agreement with Apple. ECF No. 639-1 at 74:15 ("I'm sure that we would have mentioned it[.]"); ECF No. 819-12 at 60 ("I'm sure I spoke with Paul about this at some point."). Intel's own expert testified that Mr. Otellini was likely aware of Google's other bilateral agreements by virtue of Mr. Otellini's membership on Google's board. ECF No. 771 at 4. The fact that Intel was added to Google's do-not-cold-call list on the same day that Apple was added further suggests Intel's participation in an overarching conspiracy. ECF No. 291-28.

Additionally, notwithstanding the fact that Intel and Google were competitors for talent, Mr. Otellini "lifted from Google" a Google document discussing the bonus plans of peer companies including Apple and Intel. Cisneros Decl., Ex. 463. True competitors for talent would not likely share such sensitive bonus information absent agreements not to compete.

Moreover, key documents related to antitrust impact also implicate Intel. Specifically, Intel recognized the importance of cold calling and stated in its "Complete Guide to Sourcing" that "[Cold] [c]alling candidates is one of the most efficient and effective ways to recruit." ECF No. 296-22. Intel also benchmarked compensation against other "tech companies generally considered comparable to Intel," which Intel defined as a "[b]lend of semiconductor, software, networking, communications, and diversified computer companies." ECF No. 754-2. According to Intel, in 2007, these comparable companies included Apple and Google. Id. These documents suggest, as Plaintiffs contend, that the anti-solicitation agreements led to structural, rather than individual depression, of Class members' wages.

Furthermore, Intel had a "compensation structure," with job grades and job classifications. See ECF No. 745-13 at 73 ("[W]e break jobs into one of three categories—job families, we call them—R&D, tech, and nontech, there's a lot more . . . ."). The company assigned employees to a grade level based on their skills and experience. ECF No. 745-11 at 23; see also ECF No. 749-17 at 45 (explaining that everyone at Intel is assigned a "classification" similar to a job grade). Intel standardized its salary ranges throughout the company; each range applied to multiple jobs, and most jobs spanned multiple salary grades. ECF No. 745-16 at 59. Intel further broke down its salary ranges into quartiles, and compensation at Intel followed "a bell-curve distribution, where most of the employees are in the middle quartiles, and a much smaller percentage are in the bottom and top quartiles." Id. at 62-63.

Intel also used a software tool to provide guidance to managers about an employee's pay range which would also take into account market reference ranges and merit. ECF No. 758-9. As explained by Randall Goodwin (Intel Technology Development Manager), "[i]f the tool recommended something and we thought we wanted to make a proposed change that was outside its guidelines, we would write some justification." ECF No. 749-15 at 52. Similarly, Intel regularly ran reports showing the salary range distribution of its employees. ECF No. 749-16 at 64.

The evidence also supports the rigidity of Intel's wage structure. For example, in a 2004 Human Resources presentation, Intel states that, although "[c]ompensation differentiation is desired by Intel's Meritocracy philosophy," "short and long term high performer differentiation is questionable." ECF No. 758-10 at 13. Indeed, Intel notes that "[l]ack of differentiation has existed historically based on an analysis of '99 data." Id. at 19. As key "[v]ulnerability [c]hallenges," Intel identifies: (1) "[m]anagers (in)ability to distinguish at [f]ocal"—"actual merit increases are significantly reduced from system generated increases," "[l]ong term threat to retention of key players"; (2) "[l]ittle to no actual pay differentiation for HPs [high performers]"; and (3) "[n]o explicit strategy to differentiate." Id. at 24 (emphasis added).

In addition, Intel used internal equity "to determine wage rates for new hires and current employees that correspond to each job's relative value to Intel." ECF No. 749-16 at 210-11; ECF No. 961-5. To assist in that process, Intel used a tool that generates an "Internal Equity Report" when making offers to new employees. ECF No. 749-16 at 212-13. In the words of Ogden Reid (Intel Director of Compensation and Benefits), "[m]uch of our culture screams egalitarianism . . . . While we play lip service to meritocracy, we really believe more in treating everyone the same within broad bands." ECF No. 769-8.

An Intel human resources document from 2002—prior to the anti-solicitation agreements—recognized "continuing inequities in the alignment of base salaries/EB targets between hired and acquired Intel employees" and "parallel issues relating to accurate job grading within these two populations." ECF No. 750-15. In response, Intel planned to: (1) "Review exempt job grade assignments for job families with 'critical skills.' Make adjustments, as appropriate"; and (2) "Validate perception of inequities . . . . Scope impact to employees. Recommend adjustments, as appropriate." Id. An Intel human resources document confirms that, in or around 2004, "[n]ew hire salary premiums drove salary range adjustment." ECF No. 298-5 at 7 (emphasis added).

Intel would "match an Intel job code in grade to a market survey job code in grade," ECF No. 749-16 at 89, and use that as part of the process for determining its "own focal process or pay delivery," id. at 23. If job codes fell below the midpoint, plus or minus a certain percent, the company made "special market adjustment[s]." Id. at 90.

Evidence from Adobe also suggests that Adobe was aware of the impact of its anti-solicitation agreements. Adobe personnel recognized that "Apple would be a great target to look into" for the purpose of recruiting, but knew that they could not do so because, "[u]nfortunately, Bruce [Chizen (former Adobe CEO)] and Apple CEO Steve Jobs have a gentleman's agreement not to poach each other's talent." ECF No. 291-13. Adobe executives were also part and parcel of the group of high-ranking executives that entered into, enforced, and attempted to expand the anti-solicitation agreements. Specifically, Mr. Chizen, in response to discovering that Apple was recruiting employees of Macromedia (a separate entity that Adobe would later acquire), helped ensure, through an email to Mr. Jobs, that Apple would honor Apple's pre-existing anti-solicitation agreements with both Adobe and Macromedia after Adobe's acquisition of Macromedia. ECF No. 608-3 at 50.

Adobe viewed Google and Apple to be among its top competitors for talent and expressed concern about whether Adobe was "winning the talent war." ECF No. 296-3. Adobe further considered itself in a "six-horse race from a benefits standpoint," which included Google, Apple, and Intuit as among the other "horses." See ECF No. 296-4. In 2008, Adobe benchmarked its compensation against nine companies including Google, Apple, and Intel. ECF No. 296-4; cf. ECF No. 652-6 (showing that, in 2010, Adobe considered Intuit to be a "direct peer," and considered Apple, Google, and Intel to be "reference peers," though Adobe did not actually benchmark compensation against these latter companies).

Nevertheless, despite viewing other Defendants as competitors, evidence from Adobe suggests that Adobe had knowledge of the bilateral agreements to which Adobe was not a party. Specifically, Adobe shared confidential compensation information with other Defendants, despite the fact that Adobe viewed at least some of the other Defendants as competitors and did not have a bilateral agreement with them. For example, HR personnel at Intuit and at Adobe exchanged information labeled "confidential" regarding how much compensation each firm would give and to which employees that year. ECF No. 652-8. Adobe and Intuit shared confidential compensation information even though the two companies had no bilateral anti-solicitation agreement, and Adobe viewed Intuit as a direct competitor for talent. Such direct competitors for talent would not likely share such sensitive compensation information in the absence of an overarching conspiracy.

Meanwhile, Google circulated an email that expressly discussed how its "budget is comparable to other tech companies" and compared the precise percentage of Google's merit budget increases to that of Adobe, Apple, and Intel. ECF No. 807-13. Google had Adobe's precise percentage of merit budget increases even though Google and Adobe had no bilateral anti-solicitation agreement. Such sharing of sensitive compensation information among competitors is further evidence of an overarching conspiracy.

Adobe recognized that in the absence of the anti-solicitation agreements, pay increases would be necessary, echoing Plaintiffs' theory of impact. For example, out of concern that one employee—a "star performer" due to his technical skills, intelligence, and collaborative abilities—might leave Adobe because "he could easily get a great job elsewhere if he desired," Adobe considered how best to retain him. ECF No. 799-22. In so doing, Adobe expressed concern about the fact that this employee had already interviewed with four other companies and communicated with friends who worked there. Id. Thus, Adobe noted that the employee "was aware of his value in the market" as well as the fact that the employee's friends from college were "making approximately $15k more per year than he [wa]s." Id. In response, Adobe decided to give the employee an immediate pay raise. Id.

Plaintiffs' theory of impact is also supported by evidence that every job position at Adobe was assigned a job title, and every job title had a corresponding salary range within Adobe's salary structure, which included a salary minimum, middle, and maximum. See ECF No. 804-17 at 4, 8, 72, 85-86. Adobe expected that the distribution of its existing employees' salaries would fit "a bell curve." ECF No. 749-5 at 57. To assist managers in staying within the prescribed ranges for setting and adjusting salaries, Adobe had an online salary planning tool as well as salary matrices, which provided managers with guidelines based on market salary data. See ECF No. 804-17 at 29-30 ("[E]ssentially the salary planning tool is populated with employee information for a particular manager, so the employees on their team [sic]. You have the ability to kind of look at their current compensation. It shows them what the range is for the current role that they're in . . . . The tool also has the ability to provide kind of the guidelines that we recommend in terms of how managers might want to think about spending their allocated budget."). Adobe's practice, if employees were below the minimum recommended salary range, was to "adjust them to the minimum as part of the annual review" and "red flag them." Id. at 12. Deviations from the salary ranges would also result in conversations with managers, wherein Adobe's officers explained, "we have a minimum for a reason because we believe you need to be in this range to be competitive." Id.

Internal equity was important at Adobe, as it was at other Defendants. As explained by Debbie Streeter (Adobe Vice President, Total Rewards), Adobe "always look[ed] at internal equity as a data point, because if you are going to go hire somebody externally that's making . . . more than somebody who's an existing employee that's a high performer, you need to know that before you bring them in." ECF No.749-5 at 175. Similarly, when considering whether to extend a counteroffer, Adobe advised "internal equity should ALWAYS be considered." ECF No. 746-7 at 5.

Moreover, Donna Morris (Adobe Senior Vice President, Global Human Resources Division) expressed concern "about internal equity due to compression (the market driving pay for new hires above the current employees)." ECF No. 298-9 ("Reality is new hires are requiring base pay at or above the midpoint due to an increasingly aggressive market."). Adobe personnel stated that, because of the fixed budget, they may not be able to respond to the problem immediately "but could look at [compression] for FY2006 if market remains aggressive."11 Id.

D. Weaknesses in Plaintiffs' Case

Plaintiffs contend that though this evidence is compelling, there are also weaknesses in Plaintiffs' case that make trial risky. Plaintiffs contend that these risks are substantial. Specifically, Plaintiffs point to the following challenges that they would have faced in presenting their case to a jury: (1) convincing a jury to find a single overarching conspiracy among the seven Defendants in light of the fact that several pairs of Defendants did not have anti-solicitation agreements with each other; (2) proving damages in light of the fact that Defendants intended to present six expert economists that would attack the methodology of Plaintiffs' experts; and (3) overcoming the fact that Class members' compensation has increased in the last ten years despite a sluggish economy and overcoming general anti-tech worker sentiment in light of the perceived and actual wealth of Class members. Plaintiffs also point to outstanding legal issues, such as the pending motions in limine and the pending motion to determine whether the per se or rule of reason analysis should apply, which could have aided Defendants' ability to present a case that the bilateral agreements had a pro-competitive purpose. See ECF No. 938 at 10-14.

The Court recognizes that Plaintiffs face substantial risks if they proceed to trial. Nonetheless, the Court cannot, in light of the evidence above, conclude that the instant settlement amount is within the range of reasonableness, particularly compared to the settlements with the Settled Defendants and the subsequent development of the litigation. The Court further notes that there is evidence in the record that mitigate at least some of the weaknesses in Plaintiffs' case.

As to proving an overarching conspiracy, several pieces of evidence undermine Defendants' contentions that the bilateral agreements were unrelated to each other. Importantly, two individuals, Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple) and Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), personally entered into or facilitated each of the bilateral agreements in this case. Specifically, Mr. Jobs and George Lucas (former Chairman and CEO of Lucasfilm), created the initial anti-solicitation agreement between Lucasfilm and Pixar when Mr. Jobs was an executive at Pixar. Thereafter, Apple, under the leadership of Mr. Jobs, entered into an agreement with Pixar, which, as discussed below, Pixar executives compared to the Lucasfilm-Pixar agreement. It was Mr. Jobs again, who, as discussed above, reached out to Sergey Brin (Google Co-Founder) and Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) to create the Apple-Google agreement. This agreement was reached with the assistance of Mr. Campbell, who was Intuit's Board Chairman, a friend of Mr. Jobs, and an advisor to Google. The Apple-Google agreement was discussed at Google Board meetings, at which both Mr. Campbell and Paul Otellini (Chief Executive Officer of Intel and Member of the Google Board of Directors) were present. ECF No. 819-10 at 47. After discussions between Mr. Brin and Mr. Otellini and between Mr. Schmidt and Mr. Otellini, Intel was added to Google's do-not-cold-call list. Mr. Campbell then used his influence at Google to successfully lobby Google to add Intuit, of which Mr. Campbell was Chairman of the Board of Directors, to Google's do-not-cold-call list. See ECF No. 780-6 at 8-9. Moreover, it was a mere two months after Mr. Jobs entered into the Apple-Google agreement that Apple pressured Bruce Chizen (former CEO of Adobe) to enter into an Apple-Adobe agreement. ECF No. 291-17. As this discussion demonstrates, Mr. Jobs and Mr. Campbell were the individuals most closely linked to the formation of each step of the alleged conspiracy, as they were present in the process of forming each of the links.

In light of the overlapping nature of this small group of executives who negotiated and enforced the anti-solicitation agreements, it is not surprising that these executives knew of the other bilateral agreements to which their own firms were not a party. For example, both Mr. Brin and Mr. Schmidt of Google testified that they would have told Mr. Otellini of Intel that Google had an anti-solicitation agreement with Apple. ECF No. 639-1 at 74:15 ("I'm sure we would have mentioned it[.]"); ECF No. 819-12 at 60 ("I'm sure I spoke with Paul about this at some point."). Intel's own expert testified that Mr. Otellini was likely aware of Google's other bilateral agreements by virtue of Mr. Otellini's membership on Google's board. ECF No. 771 at 4. Moreover, Google recruiters knew of the Adobe-Apple agreement. Id. (Google recruiter's notation that Apple has "a serious 'hands-off policy with Adobe"). In addition, Mr. Schmidt of Google testified that it would be "fair to extrapolate" based on Mr. Schmidt's knowledge of Mr. Jobs, that Mr. Jobs "would have extended [anti-solicitation agreements] to others." ECF No. 638-8 at 170. Furthermore, it was this same mix of top executives that successfully and unsuccessfully attempted to expand the agreement to other companies in Silicon Valley, such as eBay, Facebook, Macromedia, and Palm, as discussed above, suggesting that the agreements were neither isolated nor one off agreements.

In addition, the six bilateral agreements contained nearly identical terms, precluding each pair of Defendants from affirmatively soliciting any of each other's employees. ECF No. 531 at 30. Moreover, as discussed above, Defendants recognized the similarity of the agreements. For example, Google lumped together Apple, Intel, and Intuit on Google's "do-not-cold-call" list. Furthermore, Google's "do-not-cold-call" list stated that the Apple-Google agreement and the Intel-Google agreement commenced on the same date. Finally, in an email, Lori McAdams (Pixar Vice President of Human Resources and Administration), explicitly compared the anti-solicitation agreements, stating that "effective now, we'll follow a gentleman's agreement with Apple that is similar to our Lucasfilm agreement." ECF No. 531 at 26.

As to the contention that Plaintiffs would have to rebut Defendants' contentions that the anti-solicitation agreements aided collaborations and were therefore pro-competitive, there is no documentary evidence that links the anti-solicitation agreements to any collaboration. None of the documents that memorialize collaboration agreements mentions the broad anti-solicitation agreements, and none of the documents that memorialize broad anti-solicitation agreements mentions collaborations. Furthermore, even Defendants' experts conceded that those closest to the collaborations did not know of the anti-solicitation agreements. ECF No. 852-1 at 8. In addition, Defendants' top executives themselves acknowledge the lack of any collaborative purpose. For example, Mr. Chizen of Adobe admitted that the Adobe-Apple anti-solicitation agreement was "not limited to any particular projects on which Apple and Adobe were collaborating." ECF No. 962-7 at 42. Moreover, the U.S. Department of Justice ("DOJ") also determined that the anti-solicitation agreements "were not ancillary to any legitimate collaboration," "were broader than reasonably necessary for the formation or implementation of any collaborative effort," and "disrupted the normal price-setting mechanisms that apply in the labor setting." ECF No. 93-1 ¶ 16; ECF No. 93-4 ¶ 7. The DOJ concluded that Defendants entered into agreements that were restraints of trade that were per se unlawful under the antitrust laws. ECF No. 93-1 ¶ 35; ECF No. 93-4 ¶ 3. Thus, despite the fact that Defendants have claimed since the beginning of this litigation that there were pro-competitive purposes related to collaborations for the anti-solicitation agreements and despite the fact that the purported collaborations were central to Defendants' motions for summary judgment, Defendants have failed to produce persuasive evidence that these anti-solicitation agreements related to collaborations or were pro-competitive.

IV. CONCLUSION

This Court has lived with this case for nearly three years, and during that time, the Court has reviewed a significant number of documents in adjudicating not only the substantive motions, but also the voluminous sealing requests. Having done so, the Court cannot conclude that the instant settlement falls within the range of reasonableness. As this Court stated in its summary judgment order, there is ample evidence of an overarching conspiracy between the seven Defendants, including "[t]he similarities in the various agreements, the small number of intertwining high-level executives who entered into and enforced the agreements, Defendants' knowledge about the other agreements, the sharing and benchmarking of confidential compensation information among Defendants and even between firms that did not have bilateral anti-solicitation agreements, along with Defendants' expansion and attempted expansion of the anti-solicitation agreements." ECF No. 771 at 7-8. Moreover, as discussed above and in this Court's class certification order, the evidence of Defendants' rigid wage structures and internal equity concerns, along with statements from Defendants' own executives, are likely to prove compelling in establishing the impact of the anti-solicitation agreements: a Class-wide depression of wages.

In light of this evidence, the Court is troubled by the fact that the instant settlement with Remaining Defendants is proportionally lower than the settlements with the Settled Defendants. This concern is magnified by the fact that the case evolved in Plaintiffs' favor since those settlements. At the time those settlements were reached, Defendants still could have defeated class certification before this Court, Defendants still could have successfully sought appellate review and reversal of any class certification, Defendants still could have prevailed on summary judgment, or Defendants still could have succeeded in their attempt to exclude Plaintiffs' principal expert. In contrast, the instant settlement was reached a mere month before trial was set to commence and after these opportunities for Defendants had evaporated. While the unpredictable nature of trial would have undoubtedly posed challenges for Plaintiffs, the exposure for Defendants was even more substantial, both in terms of the potential of more than $9 billion in damages and in terms of other collateral consequences, including the spotlight that would have been placed on the evidence discussed in this Order and other evidence and testimony that would have been brought to light. The procedural history and proximity to trial should have increased, not decreased, Plaintiffs' leverage from the time the settlements with the Settled Defendants were reached a year ago.

The Court acknowledges that Class counsel have been zealous advocates for the Class and have funded this litigation themselves against extraordinarily well-resourced adversaries. Moreover, there very well may be weaknesses and challenges in Plaintiffs' case that counsel cannot reveal to this Court. Nonetheless, the Court concludes that the Remaining Defendants should, at a minimum, pay their fair share as compared to the Settled Defendants, who resolved their case with Plaintiffs at a stage of the litigation where Defendants had much more leverage over Plaintiffs.

For the foregoing reasons, the Court DENIES Plaintiffs' Motion for Preliminary Approval of the settlements with Remaining Defendants. The Court further sets a Case Management Conference for September 10, 2014 at 2 p.m.

IT IS SO ORDERED.

Dated: August 8, 2014
LUCY H. KOH
United States District Judge

  1. Dr. Leamer was subject to vigorous attack in the initial class certification motion, and this Court agreed with some of Defendants' contentions with respect to Dr. Leamer and thus rejected the initial class certification motion. See ECF No. 382 at 33-43. [return]
  2. Defendants' motions in limine, Plaintiffs' motion to exclude testimony from certain experts, Defendants' motion to exclude testimony from certain experts, a motion to determine whether the per se or rule of reason analysis applied, and a motion to compel were pending at the time the 3settlement was reached. [return]
  3. Plaintiffs in the instant Motion represent that two of the letters are from non-Class members and that the third letter is from a Class member who may be withdrawing his objection. See ECF No. 920 at 18 n.11. The objection has not been withdrawn at the time of this Order. [return]
  4. Devine stated in his Opposition that the Opposition was designed to supersede a letter that he had previously sent to the Court. See ECF No. at 934 n.2. The Court did not receive any letter from Devine. Accordingly, the Court has considered only Devine's Opposition. [return]
  5. Plaintiffs also assert that administration costs for the settlement would be $160,000. [return]
  6. Devine calculated that Class members would receive an average of $3,573. The discrepancy between this number and the Court's calculation may result from the fact that Devine's calculation does not account for the fact that 147 individuals have already opted out of the Class. The Court's calculation resulted from subtracting the requested attorneys' fees ($81,125,000), costs ($1,200,000), incentive awards ($400,000), and estimated administration costs ($160,000) from the settlement amount ($324,500,000) and dividing the resulting number by the total number of 7remaining class members (64,466). [return]
  7. If the Court were to deny any portion of the requested fees, costs, or incentive payments, this would increase individual Class members' recovery. If less than 4% of the Class were to opt out, that would also increase individual Class members' recovery. [return]
  8. One way to think about this is to set up the simple equation: 5/95 = $20,000,000/x. This equation asks the question of how much 95% would be if 5% were $20,000,000. Solving for x would result in $380,000,000. [return]
  9. On the same day, Mr. Campbell sent an email to Mr. Brin and to Larry Page (Google Co-Founder) stating, "Steve just called me again and is pissed that we are still recruiting his browser guy." ECF No. 428-13. Mr. Page responded "[h]e called a few minutes ago and demanded to talk to me." Id. [return]
  10. Mr. Jobs successfully expanded the anti-solicitation agreements to Macromedia, a company acquired by Adobe, both before and after Adobe's acquisition of Macromedia. [return]
  11. Adobe also benchmarked compensation off external sources, which supports Plaintiffs' theory of Class-wide impact and undermines Defendants' theory that the anti-solicitation agreements had only one off, non-structural effects. For example, Adobe pegged its compensation structure as a "percentile" of average market compensation according to survey data from companies such as Radford. ECF No. 804-17 at 4. Mr. Chizen explained that the particular market targets that Adobe used as benchmarks for setting salary ranges "tended to be software, high-tech, those that were geographically similar to wherever the position existed." ECF No. 962-7 at 22. This demonstrated that the salary structures of the various Defendants were linked, such that the effect of one Defendant's salary structure would ripple across to the other Defendants through external sources like Radford. [return]

Verilog Won & VHDL Lost? — You Be The Judge!

2014-08-14 08:00:00

This is an archived USENET post from John Cooley on a competitive comparison between VHDL and Verilog that was done in 1997.

I knew I hit a nerve. Usually when I publish a candid review of a particular conference or EDA product I typically see around 85 replies in my e-mail "in" box. Buried in my review of the recent Synopsys Users Group meeting, I very tersely reported that 8 out of the 9 Verilog designers managed to complete the conference's design contest yet none of the 5 VHDL designers could. I apologized for the terseness and promised to do a detailed report on the design contest at a later date. Since publishing this, my e-mail "in" box has become a veritable Verilog/VHDL Beirut filling up with 169 replies! Once word leaked that the detailed contest write-up was going to be published in the DAC issue of "Integrated System Design" (formerly "ASIC & EDA" magazine), I started getting phone calls from the chairman of VHDL International, Mahendra Jain, and from the president of Open Verilog International, Bill Fuchs. A small army of hired gun spin doctors (otherwise know as PR agents) followed with more phone calls. I went ballistic when VHDL columnist Larry Saunders had approached the Editor-in-Chief of ISD for an advanced copy of my design contest report. He felt I was "going to do a hatchet job on VHDL" and wanted to write a rebuttal that would follow my article... and all this was happening before I had even written one damned word of the article!

Because I'm an independent consultant who makes his living training and working both HDL's, I'd rather not go through a VHDL Salem witch trial where I'm publically accused of being secretly in league with the Devil to promote Verilog, thank you. Instead I'm going present everything that happened at the Design Contest, warts and all, and let you judge! At the end of court evidence, I'll ask you, the jury, to write an e-mail reply which I can publish in my column in the follow-up "Integrated System Design".

The Unexpected Results

Contestants were given 90 minutes using either Verilog or VHDL to create a gate netlist for the fastest fully synchronous loadable 9-bit increment-by-3 decrement-by-5 up/down counter that generated even parity, carry and borrow.

Of the 9 Verilog designers in the contest, only 1 didn't get to a final gate level netlist because he tried to code a look-ahead parity generator. Of the 8 remaining, 3 had netlists that missed on functional test vectors. The 5 Verilog designers who got fully functional gate-level designs were:

   Larry Fiedler     NVidea               3.90 nsec     1147 gates
   Steve Golson      Trilobyte Systems    4.30 nsec     1909 gates
   Howard Landman    HaL Computer         5.49 nsec     1495 gates
   Mark Papamarcos   EDA Associates       5.97 nsec     1180 gates
   Ed Paluch         Paluch & Assoc.      7.85 nsec     1514 gates

The surprize was that, during the same time, none of 5 VHDL designers in the contest managed to produce any gate level designs.

Not VHDL Newbies vs. Verilog Pros

The first reaction I get from the VHDL bigots (who weren't at the competition) is: "Well, this is obviously a case where Verilog veterans whipped some VHDL newbies. Big deal." Well, they're partially right. Many of those Verilog designers are damned good at what they do — but so are the VHDL designers!

I've known Prasad Paranjpe of LSI Logic for years. He has taught and still teaches VHDL with synthesis classes at U.C. Santa Cruz University Extention in the heart of Silicon Valley. He was VP of the Silicon Valley VHDL Local Users Group. He's been a full time ASIC designer since 1987 and has designed real ASIC's since 1990 using VHDL & Synopsys since rev 1.3c. Prasad's home e-mail address is "[email protected]" and his home phone is (XXX) XXX-VHDL. ASIC designer Jan Decaluwe has a history of contributing insightful VHDL and synthesis posts to ESNUG while at Alcatel and later as a founder of Easics, a European ASIC design house. (Their company motto: "Easics - The VHDL Design Company".) Another LSI Logic/VHDL contestant, Vikram Shrivastava, has used the VHDL/Synopsys design approach since 1992. These guys aren't newbies!

Creating The Contest

I followed a double blind approach to putting together this design contest. That is, not only did I have Larry Saunders (a well known VHDL columnist) and Yatin Trivedi (a well known Verilog columnist), both of Seva Technologies comment on the design contest — unknown to them I had Ken Nelsen (a VHDL oriented Methodology Manager from Synopsys) and Jeff Flieder (a Verilog based designer from Ford Microelectronics) also help check the design contest for any conceptual or implementation flaws.

My initial concern in creating the contest was to not have a situation where the Synopsys Design Compiler could quickly complete the design by just placing down a DesignWare part. Yet, I didn't want to have contestants trying (and failing) to design some fruity, off-the-wall thingy that no one truely understood. Hence, I was restricted to "standard" designs that all engineers knew — but with odd parameters thrown in to keep DesignWare out of the picture. Instead of a simple up/down counter, I asked for an up-by-3 and down-by-5 counter. Instead of 8 bits, everything was 9 bits.

                                  recycled COUNT_OUT [8:0]
                     o---------------<---------------<-------------------o
                     |                                                   |
                     V                                                   |
               -------------                     --------                |
  DATA_IN -->-|   up-by-3   |->-----carry----->-| D    Q |->- CARRY_OUT  |
   [8:0]      |  down-by-5  |->-----borrow---->-| D    Q |->- BORROW_OUT |
              |             |                   |        |               |
       UP -->-|    logic    |                   |        |               |
     DOWN -->-|             |-o------->---------| D[8:0] |               |
               -------------  | new_count [8:0] | Q[8:0] |->-o---->------o
                              |                 |        |   |
                 o------<-----o        CLOCK ---|>       |   o->- COUNT_OUT
                 |                               --------           [8:0]
 new_count [8:0] |     -----------
                 |    |   even    |              --------
                 o-->-|  parity   |->-parity-->-| D    Q |->- PARITY_OUT
                      | generator |   (1 bit)   |        |
                       -----------           o--|>       |
                                             |   --------
                                   CLOCK ----o


Fig.1) Basic block diagram outlining design's functionality

The even PARITY, CARRY and BORROW requirements were thrown in to give the contestants some space to make significant architectural trade-offs that could mean the difference between winning and losing.

The counter loaded when the UP and DOWN were both "low", and held its state when UP and DOWN were "high" — exactly opposite to what 99% of the world's loadable counters traditionally do.

                  UP  DOWN   DATA_IN    |    COUNT_OUT    
                 -----------------------------------------
                   0    0     valid     |   load DATA_IN
                   0    1   don't care  |     (Q - 5)
                   1    0   don't care  |     (Q + 3)
                   1    1   don't care  |   Q unchanged


Fig. 2) Loading and up/down counting specifications.  All I/O events
happen on the rising edge of CLOCK.

To spice things up a bit further, I chose to use the LSI Logic 300K ASIC library because wire loading & wire delay is a significant factor in this technology. Having the "home library" advantage, one saavy VHDL designer, Prasad Paranjpe of LSI Logic, cleverly asked if the default wire loading model was required (he wanted to use a zero wire load model to save in timing!) I replied: "Nice try. Yes, the default wire model is required."

To let the focus be on design and not verification, contestants were given equivalent Verilog and VHDL testbenches provided by Yatin Trivedi & Larry Saunder's Seva Technologies. These testbenches threw the same 18 vectors at the Verilog/VHDL source code the contestants were creating and if it passed, for contest purposes, their design was judged "functionally correct."

For VHDL, contestants had their choice of Synopsys VSS 3.2b and/or Cadence Leapfrog VHDL 2.1.4; for Verilog, contestants had their choice of Cadence Verilog-XL 2.1.2 or Chronologic VCS 2.3.2 plus their respective Verilog/VHDL design environments. (The CEO of Model Technology Inc., Bob Hunter, was too paranoid about the possiblity of Synopsys employees seeing his VHDL to allow it in the contest.) LCB 300K rev 3.1A.1.1.101 was the LSI Logic library.

I had a concern that some designers might not know that an XOR reduction tree is how one generates parity — but Larry, Yatin, Ken & Jeff all agreed that any engineer not knowing this shouldn't be helped to win a design contest. As a last minute hint, I put in every contestant's directory an "xor.readme" file that named the two XOR gates available in LSI 300K library (EO and EO3) plus their drive strengths and port lists.

To be friendly synthesis-wise, I let the designers keep the unrealistic Synopsys default setting of all inputs having infinite input drive strength and all outputs were driving zero loads.

The contest took place in three sessions over the same day. To keep things equal, my guiding philosophy throughout these sessions was to conscientiously not fix/improve anything between sessions — no matter how frustrating!

After all that was said & done, Larry & Yatin thought that the design contest would be too easy while Ken & Jeff thought it would have just about the right amount of complexity. I asked all four if they saw any Verilog or VHDL specific "gotchas" with the contest; all four categorically said "no."

Murphy's Law

Once the contest began, Murphy's Law — "that which can go wrong, will go wrong" — prevailed. Because we couldn't get the SUN and HP workstations until a terrifying 3 days before the contest, I lived through a nightmare domino effect on getting all the Verilog, VHDL, Synopsys and LSI libraries in and installed. Nobody could cut keys for the software until the machine ID's were known — and this wasn't until 2 days before the contest! (As it was, I had to drop the HP machines because most of the EDA vendors couldn't cut software keys for HP machines as fast as they could for SUN workstations.)

The LSI 300K Libraries didn't arrive until an hour before the contest began. The Seva guys found and fixed a bug in the Verilog testbench (that didn't exist in the VHDL testbench) some 15 minutes before the constest began.

Some 50 minutes into the first design session, one engineer's machine crashed — which also happened to be the licence server for all the Verilog simulation software! (Luckily, by this time all the Verilog designers were deep into the synthesis stage.) Unfortunately, the poor designer who had his machine crash couldn't be allowed to redo the contest in a following session because of his prior knowlege of the design problem. This machine was rebooted and used solely as a licence server for the rest of the contest.

The logistics nightmare once again reared its ugly head when two designers innocently asked: "John, where are your Synopsys manuals?" Inside I screamed to myself: "OhMyGod! OhMyGod! OhMyGod!"; outside I calmly replied: "There are no manuals for any software here. You have to use the online docs available."

More little gremlins danced in my head when I realized that six of the eight data books that the LSI lib person brought weren't for the exact LCB 300K library we were using — these data books would be critical for anyone trying to hand build an XOR reduction tree — and one Verilog contestant had just spent ten precious minutes reading a misleading data book! (There were two LCB 300K, one LCA 300K and five LEA 300K databooks.) Verilog designer Howard Landman of HaL Computer noted: "I probably wasted 15 minutes trying to work through this before giving up and just coding functional parity — although I used parentheses in hopes of Synopsys using 3-input XOR gates."

Then, just as things couldn't get worst, everyone got to discover that when Synopsys's Design Compiler runs for the first time in a new account — it takes a good 10 to 15 minutes to build your very own personal DesignWare cache. Verilog contestant Ed Paluch, a consultant, noted: "I thought that first synthesis run building [expletive deleted] DesignWare caches would never end! It felt like days!"

Although, in my opinion, none of these headaches compromised the integrity of the contest, at the time I had to continually remind myself: "To keep things equal, I can not fix nor improve anything no matter how frustrating."

Judging The Results

Because I didn't want to be in the business of judging source code intent, all judging was based solely on whether the gate level passed the previously described 18 test vectors. Once done, the design was read into the Synopsys Design Compiler and all constraints were removed. Then I applied the command "clocks_at 0, 6, 12 clock" and then took the longest path as determined by "report_timing -path full -delay max -max_paths 12" as the final basis for comparing designs — determining that Verilog designer Larry Fiedler of NVidia won with a 1147 gate design timed at 3.90 nsec.

      reg [9:0] cnt_up, cnt_dn;   reg [8:0] count_nxt;

      always @(posedge clock)
      begin
        cnt_dn = count_out - 3'b 101;  // synopsys label add_dn
        cnt_up = count_out + 2'b 11;   // synopsys label add_up

        case ({up,down})
           2'b 00 : count_nxt = data_in;
           2'b 01 : count_nxt = cnt_dn;
           2'b 10 : count_nxt = cnt_up;
           2'b 11 : count_nxt = 9'bX;  // SPEC NOT MET HERE!!!
          default : count_nxt = 9'bX;  // avoiding ambiguity traps
        endcase

        parity_out  <= ^count_nxt;
        carry_out   <= up & cnt_up[9];
        borrow_out  <= down & cnt_dn[9];
        count_out   <= count_nxt;
      end


Fig. 3) The winning Verilog source code.  (Note that it failed to meet
the spec of holding its state when UP and DOWN were both high.)

Since judging was open to any and all who wanted to be there, Kurt Baty, a Verilog contestant and well respected design consultant, registered a vocal double surprize because he knew his design was of comparable speed but had failed to pass the 18 test vectors. (Kurt's a good friend — I really enjoyed harassing him over this discovery — especially since he had bragged to so many people on how he was going to win this contest!) An on the spot investigation yielded that Kurt had accidently saved the wrong design in the final minute of the contest. Even further investigation then also yielded that the 18 test vectors didn't cover exactly all the counter's specified conditions. Larry's "winning" gate level Verilog based design had failed to meet the spec of holding its state when UP and DOWN were high — even though his design had successfully passed the 18 test vectors!

If human visual inspection of the Verilog/VHDL source code to subjectively check for places where the test vectors might have missed was part of the judging criteria, Verilog designer Steve Golson would have won. Once again, I had to reiterate that all designs which passed the testbench vectors were considered "functionally correct" by definition.

What The Contestants Thought

Despite NASA VHDL designer Jeff Solomon's "I didn't like the idea of taking the traditional concept of counters and warping it to make a contest design problem", the remaining twelve contestants really liked the architectural flexiblity of the up-by-3/down-by-5, 9 bit, loadable, synchronous counter with even party, carry and borrow. Verilog designer Mark Papamarcos summed up the majority opinion with: "I think that the problem was pretty well devised. There was a potential resource sharing problem, some opportunities to schedule some logic to evaluate concurrently with other logic, etc. When I first saw it, I thought it would be very easy to implement and I would have lots of time to tune. I also noticed the 2 and 3-input XOR's in the top-level directory, figured that it might be somehow relevant, but quickly dismissed any clever ideas when I ran into problems getting the vectors to match."

Eleven of contestants were tempted by the apparent correlation between known parity and the adding/subtracting of odd numbers. Only one Verilog designer, Oren Rubinstein of Hewlett-Packard Canada, committed to this strategy but ran way out of time. Once home, Kurt Baty helped Oren conceptually finish his design while Prasad Paranjpe helped with the final synthesis. It took about 7 hours brain time and 8 hours coding/sim/synth time (15 hours total) to get a final design of 3.05 nsec & 1988 gates. Observing it took 10x the original estimated 1.5 hours to get a 22% improvement in speed, Oren commented: "Like real life, it's impossible to create accurate engineering design schedules."

Two of the VHDL designers, Prasad Paranjpe of LSI Logic and Jan Decaluwe of Easics, both complained of having to deal with type conversions in VHDL. Prasad confessed: "I can't believe I got caught on a simple typing error. I used IEEE std_logic_arith, which requires use of unsigned & signed subtypes, instead of std_logic_unsigned." Jan agreed and added: "I ran into a problem with VHDL or VSS (I'm still not sure.) This case statement doesn't analyze: "subtype two_bits is unsigned(1 downto 0); case two_bits'(up & down)..." But what worked was: "case two_bits'(up, down)..." Finally I solved this problem by assigning the concatenation first to a auxiliary variable."

Verilog competitor Steve Golson outlined the first-get-a-working-design-and- then-tweak-it-in-synthesis strategy that most of the Verilog contestants pursued with: "As I recall I had some stupid typos which held me up; also I had difficulty with parity and carry/borrow. Once I had a correctly functioning baseline design, I began modifying it for optimal synthesis. My basic idea was to split the design into four separate modules: the adder, the 4:1 MUXes, the XOR logic (parity and carry/borrow), and the top counter module which contains only the flops and instances of the other three modules. My strategy was to first compile the three (purely combinational) submodules individually. I used a simple "max_delay 0 all_outputs()" constraint on each of them. The top-level module got the proper clock constraint. Then "dont_touch" these designs, and compile the top counter module (this just builds the flops). Then to clean up I did an "ungroup -all" followed by a "compile -incremental" (which shaved almost 1 nsec off my critical path.)"

Typos and panic hurt the performance of a lot of contestants. Verilog designer Daryoosh Khalilollahi of National Semiconductor said: "I thought I would not be able to finish it on time, but I just made it. I lost some time because I would get a Verilog syntax error that turned up because I had one extra file in my Verilog "include" file (verilog -f include) which was not needed." Also, Verilog designer Howard Landman of Hal Computers never realized he had put both a complete behavioral and a complete hand instanced parity tree in his source Verilog. (Synopsys Design Compiler just optimized one of Howard's dual parity trees away!)

On average, each Verilog designer managed to get two to five synthesis runs completed before running out of time. Only two VHDL designers, Jeff Solomon and Jan Decaluwe, managed to start (but not complete) one synthesis run. In both cases I disqualified them from the contest for not making the deadline but let their synthesis runs attempt to finish. Jan arrived a little late so we gave Jan's run some added time before disqualifying him. His unfinished run had to be killed after 21 minutes because another group of contestants were arriving. (Incidently, I had accidently given the third session an extra 6 design minutes because of a goof on my part. No Verilog designers were in this session but VHDL designers Jeff Solomon, Prasad Paranjpe, Vikram Shrivastava plus Ravi Srinivasan of Texus Instruments all benefited from this mistake.) Since Jeff was in the last session, I gave him all the time needed for his run to complete. After an additional 17 minutes (total) he produced a gate level design that timed out to 15.52 nsec. After a total of 28 more minutes he got the timing down to 4.46 nsec but his design didn't pass functional vectors. He had an error somewhere in his VHDL source code.

Failed Verilog designer Kurt Baty closed with: "John, I look forward to next year's design contest in whatever form or flavor it takes, and a chance to redeem my honor."

Closing Arguments To The Jury

Closing aurguments the VHDL bigots may make in this trial might be: "What 14 engineers do isn't statistically significant. Even the guy who ran this design contest admitted all sorts of last minute goofs with it. You had a workstation crash, no manuals & misleading LSI databooks. The test vectors were incomplete. One key VHDL designer ran into a Synopsys VHDL simulator bug after arriving late to his session. The Verilog design which won this contest didn't even meet the spec completely! In addition, this contest wasn't put together to be a referendum on whether Verilog or VHDL is the better language to design in — hence it may miss some major issues."

The Verilog bigots might close with: "No engineers work under the contrived conditions one may want for an ideal comparision of Verilog & VHDL. Fourteen engineers may or may not be statistally significant, but where there's smoke, there's fire. I saw all the classical problems engineers encounter in day to day designing here. We've all dealt with workstation crashes, bad revision control, bugs in tools, poor planning and incomplete testing. It's because of these realities I think this design contest was perfect to determine how each HDL measures up in real life. And Verilog won hands down!"

The jury's veridict will be seen in the next "Integrated System Design".

You The Jury...

You the jury are now asked to please take ten minutes to think about what you have just read and, in 150 words or less, send your thoughts to me at "[email protected]". Please don't send me "VHDL sucks." or "Verilog must die!!!" — but personal experiences and/or observations that add to the discussion. It's OK to have strong/violent opinions, just back them with something more than hot air. (Since I don't want to be in the business of chasing down permissions, my default setting is whatever you send me is completely publishable. If you wish to send me letters with a mix of publishable and non-publishable material CLEARLY indicate which is which.) I will not only be reprinting replied letters, I'll also be publishing stats on how many people had reported each type of specific opinion/experience.

John Cooley
Part Time EDA Consumer Advocate
Full Time ASIC, FPGA & EDA Design Consultant

P.S. In replying, please indicate your job, your company, whether you use Verilog or VHDL, why, and for how long. Also, please DO NOT copy this article back to me — I know why you're replying! :^)

Data-driven bug finding

2014-04-06 08:00:00

I can't remember the last time I went a whole day without running into a software bug. For weeks, I couldn't invite anyone to Facebook events due to a bug that caused the invite button to not display on the invite screen. Google Maps has been giving me illegal and sometimes impossible directions ever since I moved to a small city. And Google Docs regularly hangs when I paste an image in, giving me a busy icon until I delete the image.

It's understandable that bugs escape testing. Testing is hard. Integration testing is harder. End to end testing is even harder. But there's an easier way. A third of bugs like this – bugs I run into daily – could be found automatically using analytics.

If you think finding bugs with analytics sounds odd, ask a hardware person about performance counters. Whether or not they're user accessible, every ASIC has analytics to allow designers to figure out what changes need to be made for the next generation chip. Because people look at perf counters anyway, they notice when a forwarding path never gets used, when way prediction has a strange distribution, or when the prefetch buffer never fills up. Unexpected distributions in analytics are a sign of a misunderstanding, which is often a sign of a bug1.

Facebook logs all user actions. That can be used to determine user dead ends. Google Maps reroutes after “wrong” turns. That can be used to determine when the wrong turns are the result of bad directions. Google Docs could track all undos2. That could be used to determine when users run into misfeatures or bugs3.

I understand why it might feel weird to borrow hardware practices for software development. For the most part, hardware tools are decades behind software tools. As examples: current hardware tools include simulators on Linux that are only half ported from windows, resulting in some text boxes requiring forward slashes while others require backslashes; libraries that fail to compile with `default_nettype none4; and components that come with support engineers because they're expected to be too buggy to work without full-time people supporting any particular use.

But when it comes to testing, hardware is way ahead of software. When I write software, fuzzing is considered a state of the art technique. But in hardware land, fuzzing doesn't have a special name. It's just testing, and why should there be a special name for "testing that uses randomness"? That's like having a name for "testing by running code". Well over a decade ago, I did hardware testing via a tool that used constrained randomness on inputs, symbolic execution, with state reduction via structural analysis. For small units, the tool was able to generate a formal proof of correctness. For larger units, the tool automatically generated and used coverage statistics and used them to exhaustively search over as diverse a state space as possible. In the case of a bug, a short, easy to debug, counter example would be produced. And hardware testing tools have gotten a lot better since then.

But in software land, I'm lucky if a random project I want to contribute to has tests at all. When tests exist, they're usually handwritten, with all the limitations that implies. Once in a blue moon, I'm pleasantly surprised to find that a software project uses a test framework which has 1% of the functionality that was standard a decade ago in chip designs.

Considering the relative cost of hardware bugs vs. software bugs, it's not too surprising that a lot more effort goes into hardware testing. But here's a case where there's almost no extra effort. You've already got analytics measuring the conversion rate through all sorts of user funnels. The only new idea here is that clicking on an ad or making a purchase isn't the only type of conversion you should measure. Following directions at an intersection is a conversion, not deleting an image immediately after pasting it is a conversion, and using a modal dialogue box after opening it up is a conversion.

Of course, whether it's ad click conversion rates or cache hit rates, blindly optimizing a single number will get you into a local optima that will hurt you in the long run, and setting thresholds for conversion rates that should send you an alert is nontrivial. There's a combinatorially large space of user actions, so it takes judicious use of machine learning to figure out reasonable thresholds. That's going to cost time and effort. But think of all the effort you put into optimizing clicks. You probably figured out, years ago, that replacing boring text with giant pancake buttons gives you 3x the clickthrough rate; you're now down to optimizing 1% here and 2% there. That's great, and it's a sign that you've captured all the low hanging fruit. But what do you think the future clickthrough rate is when a user encounters a show-stopping bug that prevents any forward progress on a modal dialogue box?

If this sounds like an awful lot of work, find a known bug that you've fixed, and grep your logs data for users who ran into that bug. Alienating those users by providing a profoundly broken product is doing a lot more to your clickthrough rate than having a hard to find checkout button, and the exact same process that led you to that gigantic checkout button can solve your other problem, too. Everyone knows that adding 200ms of load time can cause 20% of users to close the window. What do you think the effect of exposing them to a bug that takes 5,000ms of user interaction to fix is?

If that's worth fixing, pull out scalding, dremel, cascalog, or whatever your favorite data processing tool is. Start looking for user actions that don't make sense. Start looking for bugs.

Thanks to Pablo Torres for catching a typo in this post


  1. It's not that all chip design teams do this systematically (although they should), but that people are looking at the numbers anyway, and will see anomalies. [return]
  2. Undos aren't just literal undos; pasting an image in and then deleting it afterwards because it shows a busy icon forever counts, too. [return]
  3. This is worse than it sounds. In addition to producing a busy icon forever in the doc, it disconnects that session from the server, which is another thing that could be detected: it's awfully suspicious if a certain user action is always followed by a disconnection.

    Moreover, both of these failure modes could have been found with fuzzing, since they should never happen. Bugs are hard enough to find that defense in depth is the only reasonable solution.

    [return]
  4. if you talk to a hardware person, call this verification instead of testing, or they'll think you're talking about DFT, testing silicon for manufacturing defects, or some other weird thing with no software analogue. [return]

Editing binaries

2014-03-23 08:00:00

Editing binaries is a trick that comes in handy a few times a year. You don't often need to, but when you do, there's no alternative. When I mention patching binaries, I get one of two reactions: complete shock or no reaction at all. As far as I can tell, this is because most people have one of these two models of the world:

  1. There exists source code. Compilers do something to source code to make it runnable. If you change the source code, different things happen.

  2. There exists a processor. The processor takes some bits and decodes them to make things happen. If you change the bits, different things happen.

If you have the first view, breaking out a hex editor to modify a program is the action of a deranged lunatic. If you have the second view, editing binaries is the most natural thing in the world. Why wouldn't you just edit the binary? It's often the easiest way to get what you need.

For instance, you're forced to do this all the time if you use a non-Intel non-AMD x86 processor. Instead of checking CPUID feature flags, programs will check the CPUID family, model, and stepping to determine features, which results in incorrect behavior on non-standard CPUs. Sometimes you have to do an edit to get the program to use the latest SSE instructions and sometimes you have to do an edit to get the program to run at all. You can try filing a bug, but it's much easier to just edit your binaries.

Even if you're running on a mainstream Intel CPU, these tricks are useful when you run into bugs in closed sourced software. And then there are emergencies.

The other day, a DevOps friend of mine at a mid-sized startup told me about the time they released an internal alpha build externally, which caused their auto-update mechanism to replace everyone's working binary with a buggy experimental version. It only took a minute to figure out what happened. Updates gradually roll out to all users over a couple days, which meant that the bad version had only spread to 1 / (60*24*2) = 0.03% of all users. But they couldn't push the old version into the auto-updater because the client only accepts updates from higher numbered versions. They had to go through the entire build and release process (an hour long endeavor) just to release a version that was identical to their last good version. If it had occurred to anyone to edit the binary to increment the version number, they could have pushed out a good update in a minute instead of an hour, which would have kept the issue from spreading to more than 0.06% of their users, instead of sending 2% of their users a broken update1.

This isn't nearly as hard as it sounds. Let's try an example. If you're going to do this sort of thing regularly, you probably want to use a real disassembler like IDA2. But, you can get by with simple tools if you only need to do this every once in a while. I happen to be on a Mac that I don't use for development, so I'm going to use lldb for disassembly and HexFiend to edit this example. Gdb, otool, and objdump also work fine for quick and dirty disassembly.

Here's a toy code snippet, wat-arg.c, that should be easy to binary edit:

#include <stdio.h>

int main(int argc, char **argv) {
  if (argc > 1) {
    printf("got an arg\n");
  } else {
    printf("no args\n");
  }
}

If we compile this and then launch lldb on the binary and step into main, we can see the following machine code:

$ lldb wat-arg
(lldb) breakpoint set -n main
Breakpoint 1: where = original`main, address = 0x0000000100000ee0
(lldb) run
(lldb) disas -b -p -c 20
;  address       hex opcode            disassembly
-> 0x100000ee0:  55                    pushq  %rbp
   0x100000ee1:  48 89 e5              movq   %rsp, %rbp
   0x100000ee4:  48 83 ec 20           subq   $32, %rsp
   0x100000ee8:  c7 45 fc 00 00 00 00  movl   $0, -4(%rbp)
   0x100000eef:  89 7d f8              movl   %edi, -8(%rbp)
   0x100000ef2:  48 89 75 f0           movq   %rsi, -16(%rbp)
   0x100000ef6:  81 7d f8 01 00 00 00  cmpl   $1, -8(%rbp)
   0x100000efd:  0f 8e 16 00 00 00     jle    0x100000f19               ; main + 57
   0x100000f03:  48 8d 3d 4c 00 00 00  leaq   76(%rip), %rdi            ; "got an arg\n"
   0x100000f0a:  b0 00                 movb   $0, %al
   0x100000f0c:  e8 23 00 00 00        callq  0x100000f34               ; symbol stub for: printf
   0x100000f11:  89 45 ec              movl   %eax, -20(%rbp)
   0x100000f14:  e9 11 00 00 00        jmpq   0x100000f2a               ; main + 74
   0x100000f19:  48 8d 3d 42 00 00 00  leaq   66(%rip), %rdi            ; "no args\n"
   0x100000f20:  b0 00                 movb   $0, %al
   0x100000f22:  e8 0d 00 00 00        callq  0x100000f34               ; symbol stub for: printf

As expected, we load a value, compare it to 1 with cmpl $1, -8(%rbp), and then print got an arg or no args depending on which way we jump as a result of the compare.

$ ./wat-arg
no args
$ ./wat-arg 1
got an arg

If we open up a hex editor and change 81 7d f8 01 00 00 00; cmpl 1, -8(%rbp) to 81 7d f8 06 00 00 00; cmpl 6, -8(%rbp), that should cause the program to check for 6 args instead of 1

Replace cmpl with cmpl 6

$ ./wat-arg
no args
$ ./wat-arg 1
no args
$ ./wat-arg 1 2
no args
$ ./wat-arg 1 2 3 4 5 6 7 8
got an arg

Simple! If you do this a bit more, you'll soon get in the habit of patching in 903 to overwrite things with NOPs. For example, if we replace 0f 8e 16 00 00 00; jle and e9 11 00 00 00; jmpq with 90, we get the following:

   0x100000ee1:  48 89 e5              movq   %rsp, %rbp
   0x100000ee4:  48 83 ec 20           subq   $32, %rsp
   0x100000ee8:  c7 45 fc 00 00 00 00  movl   $0, -4(%rbp)
   0x100000eef:  89 7d f8              movl   %edi, -8(%rbp)
   0x100000ef2:  48 89 75 f0           movq   %rsi, -16(%rbp)
   0x100000ef6:  81 7d f8 01 00 00 00  cmpl   $1, -8(%rbp)
   0x100000efd:  90                    nop
   0x100000efe:  90                    nop
   0x100000eff:  90                    nop
   0x100000f00:  90                    nop
   0x100000f01:  90                    nop
   0x100000f02:  90                    nop
   0x100000f03:  48 8d 3d 4c 00 00 00  leaq   76(%rip), %rdi            ; "got an arg\n"
   0x100000f0a:  b0 00                 movb   $0, %al
   0x100000f0c:  e8 23 00 00 00        callq  0x100000f34               ; symbol stub for: printf
   0x100000f11:  89 45 ec              movl   %eax, -20(%rbp)
   0x100000f14:  90                    nop
   0x100000f15:  90                    nop
   0x100000f16:  90                    nop
   0x100000f17:  90                    nop
   0x100000f18:  90                    nop
   0x100000f19:  48 8d 3d 42 00 00 00  leaq   66(%rip), %rdi            ; "no args\n"
   0x100000f20:  b0 00                 movb   $0, %al
   0x100000f22:  e8 0d 00 00 00        callq  0x100000f34               ; symbol stub for: printf

Note that since we replaced a couple of multi-byte instructions with single byte instructions, the program now has more total instructions.

$ ./wat-arg
got an arg
no args

Other common tricks include patching in cc to redirect to an interrupt handler, db to cause a debug breakpoint, knowing which bit to change to flip the polarity of a compare or jump, etc. These things are all detailed in the Intel architecture manuals, but the easiest way to learn these is to develop the muscle memory for them one at a time.

Have fun!


  1. I don't actually recommend doing this in an emergency if you haven't done it before. Pushing out a known broken binary that leaks details from future releases is bad, but pushing out an update that breaks your updater is worse. You'll want, at a minimum, a few people who create binary patches in their sleep to code review the change to make sure it looks good, even after running it on a test client.

    Another solution, not quite as "good", but much less dangerous, would have been to disable the update server until the new release was ready.

    [return]
  2. If you don't have $1000 to spare, r2 is a nice, free, tool with IDA-like functionality. [return]
  3. on x86 [return]

That bogus gender gap article

2014-03-09 08:00:00

Last week, Quartz published an article titled “There is no gender gap in tech salaries”. That resulted in linkbait copycat posts all over the internet, from obscure livejournals to Smithsonian.com. The claims are awfully strong, considering that the main study cited only looked at people who graduated with a B.S. exactly one year ago, not to mention the fact that the study makes literally the opposite claim.

Let's look at the evidence from the AAUW study that all these posts cite.

Who are you going to believe, me or your lying eyes?

Looks like women make 88% of what men do in “engineering and engineering technology” and 77% of what men do in “computer and information sciences”.

The study controls for a number of factors to try to find the source of the pay gap. It finds that after controlling for self-reported hours worked, type of employment, and quality of school, “over one-third of the pay gap cannot be explained by any of these factors and appears to be attributable to gender alone”. One-third is not zero, nor is one-third of 12% or 23%. If that sounds small, consider an average raise in the post-2008 economy and how many years of experience that one-third of 23% turns into.

The Quartz article claims that, since the entire gap can be explained by some variables, the gap is by choice. In fact, the study explicitly calls out that view as being false, citing Stender v. Lucky Stores and a related study1, saying that “The case illustrates how discrimination can play a role in the explained portion of the pay gap when employers mistakenly assume that female employees prefer lower-paid positions traditionally held by women and --intentionally or not--place men and women into different jobs, ensuring higher pay for men and lower pay for women”. Women do not, in fact, just want lower paying jobs; this is, once again, diametrically opposed to the claims in the Quartz article.

Note that the study selectively controls for factors that reduce the pay gap, but not for factors that increase it. For instance, the study notes that “Women earn higher grades in college, on average, than men do, so academic achievement does not help us understand the gender pay gap”. Adjusting for grades would increase the pay gap; adjusting for all possible confounding factors, not only the factors that reduce the gap, would only make the adjusted pay gap larger.

The AAUW study isn't the only evidence the Quartz post cites. To support the conclusion that “Despite strong evidence suggesting gender pay equality, there is still a general perception that women earn less than men do”, the Quartz author cites three additional pieces of evidence. First, the BLS figure that, “when measured hourly, not annually, the pay gap between men and women is 14% not 23%”; 14% is not 0%. Second, a BLS report that indicates that men make more than women, cherry picking a single figure where women do better than men (“women who work between 30 and 39 hours a week … see table 4”); this claim is incorrect2. Third, a study from the 80s which is directly contradicted by the AAUW report from 2012; the older study indicates that cohort effects are responsible for the gender gap, but the AAUW report shows a gender gap despite studying only a single cohort.

The Smithsonian Mag published a correction in response to criticism about their article, but most of the mis-informed articles remain uncorrected.

It's clear that the author of the Quartz piece had an agenda in mind, picked out evidence that supported that agenda, and wrote a blog post. A number of bloggers picked up the post and used its thesis as link bait to drive hits to their sites, without reading any of the cited evidence. If this is how “digitally native news” works, I'm opting out.

If you liked reading this, you might also enjoy this post on the interaction of markets with discrimination, and this post, which has a very partial explanation of why so many people drop out of science and engineering.

Updates

Update: A correction! I avoided explicitly linking to the author of the original article, because I find the sort of twitter insults and witch hunts that often pop up to be unconstructive, and this is really about what's right and not who's right. The author obviously disagrees because I saw no end of insults until I blocked the author.

Charlie Clarke was kind enough to wade through the invective and decode the author's one specific claim that about my illiteracy was that footnote 3 was not rounded from 110.3 to 111. It turns out that instead of rounding from 110.3 to 111, the author of the article cited the wrong source entirely and the other source just happened to have a number that was similar to 111.


  1. There's plenty of good news in this study. The gender gap has gotten much smaller over the past forty years. There's room for a nuanced article that explores why things improved, and why certain aspects have improved while others have remained stubbornly stuck in the 70s. I would love to read that article. [return]
  2. The Quartz article claims that “women who work 30 to 39 hours per week make 111% of what men make (see table 4)”. Table 4 is a breakdown of part-time workers. There is no 111% anywhere in the table, unless 110.3% is rounded to 111%; perhaps the author is referring to the racial breakdown in the table, which indicates that among Asian part-time workers, women earn 110.3% of what men do per hour. Note that Table 3, showing a breakdown of full-time workers (who are the vast majority of workers) indicates that women earn much less than men when working full time. To find a figure that supports the author's agenda, the author had to not only look at part time workers, but only look at part-time Asian women, and then round .3% up to 1%. [return]

That time Oracle tried to have a professor fired for benchmarking their database

2014-03-05 08:00:00

In 1983, at the University of Wisconsin, Dina Bitton, David DeWitt, and Carolyn Turbyfill created a database benchmarking framework. Some of their results included (lower is better):

Join without indices

system joinAselB joinABprime joinCselAselB
U-INGRES 10.2 9.6 9.4
C-INGRES 1.8 2.6 2.1
ORACLE > 300 > 300 > 300
IDMnodac > 300 > 300 > 300
IDMdac > 300 > 300 > 300
DIRECT 10.2 9.5 5.6
SQL/DS 2.2 2.2 2.1

Join with indices, primary (clustered) index

system joinAselB joinABprime joinCselAselB
U-INGRES 2.11 1.66 9.07
C-INGRES 0.9 1.71 1.07
ORACLE 7.94 7.22 13.78
IDMnodac 0.52 0.59 0.74
IDMdac 0.39 0.46 0.58
DIRECT 10.21 9.47 5.62
SQL/DS 0.92 1.08 1.33

Join with indicies, secondary (non-clustered) index

system joinAselB joinABprime joinCselAselB
U-INGRES 4.49 3.24 10.55
C-INGRES 1.97 1.80 2.41
ORACLE 8.52 9.39 18.85
IDMnodac 1.41 0.81 1.81
IDMdac 1.19 0.59 1.47
DIRECT 10.21 9.47 5.62
SQL/DS 1.62 1.4 2.66

Projection (duplicate tuples removed)

system 100/10000 1000/10000
U-INGRES 64.6 236.8
C-INGRES 26.4 132.0
ORACLE 828.5 199.8
IDMnodac 29.3 122.2
IDMdac 22.3 68.1
DIRECT 2068.0 58.0
SQL/DS 28.8 28.0

Aggregate without indicies

system MIN scalar MIN agg fn 100 parts SUM agg fun 100 parts
U-INGRES 40.2 176.7 174.2
C-INGRES 34.0 495.0 484.4
ORACLE 145.8 1449.2 1487.5
IDMnodac 32.0 65.0 67.5
IDMdac 21.2 38.2 38.2
DIRECT 41.0 227.0 229.5
SQL/DS 19.8 22.5 23.5

Aggregate with indicies

system MIN scalar MIN agg fn 100 parts SUM agg fun 100 parts
U-INGRES 41.2 186.5 182.2
C-INGRES 37.2 242.2 254.0
ORACLE 160.5 1470.2 1446.5
IDMnodac 27.0 65.0 66.8
IDMdac 21.2 38.0 38.0
DIRECT 41.0 227.0 229.5
SQL/DS 8.5 22.8 23.8

Selection without indicies

system 100/10000 1000/10000
U-INGRES 53.2 64.4
C-INGRES 38.4 53.9
ORACLE 194.2 230.6
IDMnodac 31.7 33.4
IDMdac 21.6 23.6
DIRECT 43.0 46.0
SQL/DS 15.1 38.1

Selection with indicies

system 100/10000 clustered 100/10000 clustered 100/10000 1000/10000
U-INGRES 7.7 27.8 59.2 78.9
C-INGRES 3.9 18.9 11.4 54.3
ORACLE 16.3 130.0 17.3 129.2
IDMnodac 2.0 9.9 3.8 27.6
IDMdac 1.5 8.7 3.3 23.7
DIRECT 43.0 46.0 43.0 46.0
SQL/DS 3.2 27.5 12.3 39.2

In case you're familiar with the database universe as of 1983, at the time, INGRES was a research project by Stonebreaker and Wong at Berkeley that had been commercialized. C-INGRES is the commercial versionn and U-INGRES is the university version. IDM* are the IDM/500 database machine, the first widely used commercial database machine; dac is with a "database accelerator" and nodac is without. DIRECT was a research project in database machines that was started by DeWitt in 1977.

In Bitton et al.'s work, Oracle's performance stood out as unusually poor.

Larry Ellison wasn't happy with the results and it's said that he tried to have DeWitt fired. Given how difficult it is to fire professors when there's actual misconduct, the probability of Ellison sucessfully getting someone fired for doing legitimate research in their field was pretty much zero. It's also said that, after DeWitt's non-firing, Larry banned Oracle from hiring Wisconsin grads and Oracle added a term to their EULA forbidding the publication of benchmarks. Over the years, many major commercial database vendors added a license clause that made benchmarking their database illegal.

Today, Oracle hires from Wisconsin, but Oracle still forbids benchmarking of their database. Oracle's shockingly poor performance and Larry Ellison's response have gone down in history; anti-benchmarking clauses are now often known as "DeWitt Clauses", and they've spread from databases to all software, from compilers to cloud offerings1.

Meanwhile, Bitcoin users have created anonymous markets for assassinations -- users can put money into a pot that gets paid out to the assassin who kills a particular target.

Anonymous assassination markets appear to be a joke, but how about anonymous markets for benchmarks? People who want to know what kind of performance a database offers under a certain workload puts money into a pot that gets paid out to whoever runs the benchmark.

With things as they are now, you often see comments and blog posts about how someone was using postgres until management made them switch to "some commercial database" which had much worse performance and it's hard to tell if the terrible database was Oracle, MS SQL server, or perhaps another database.

If we look at major commercial databases today, two out of the three big names in commericial databases forbid publishing benchmarks. Microsoft's SQL server eula says:

You may not disclose the results of any benchmark test ... without Microsoft’s prior written approval

Oracle says:

You may not disclose results of any Program benchmark tests without Oracle’s prior consent

IBM is notable for actually allowing benchmarks:

Licensee may disclose the results of any benchmark test of the Program or its subcomponents to any third party provided that Licensee (A) publicly discloses the complete methodology used in the benchmark test (for example, hardware and software setup, installation procedure and configuration files), (B) performs Licensee's benchmark testing running the Program in its Specified Operating Environment using the latest applicable updates, patches and fixes available for the Program from IBM or third parties that provide IBM products ("Third Parties"), and (C) follows any and all performance tuning and "best practices" guidance available in the Program's documentation and on IBM's support web sites for the Program...

This gives people ammunition for a meta-argument that IBM probably delivers better performance than either Oracle or Microsoft, since they're the only company that's not scared of people publishing benchmark results, but it would be nice if we had actual numbers.

Thanks to Leah Hanson and Nathan Wailes for comments/corrections/discussion.


  1. There's at least one cloud service that disallows not only publishing benchmarks, but even "competitive benchmarking", running benchmarks to see how well the competition does. As a result, there's a product I'm told I shouldn't use to avoid even the appearance of impropriety because I work in an office with people who work on cloud related infrastructure.

    An example of a clause like this is the following term in the Salesforce agreement:

    You may not access the Services for purposes of monitoring their availability, performance or functionality, or for any other benchmarking or competitive purposes.

    If you ever wondered why uptime "benchmarking" services like cloudharmony don't include Salesforce, this is probably why. You will sometimes see speculation that Salesforce and other companies with these terms know that their service is so poor that it would be worse to have public benchmarks than to have it be known that they're afraid of public benchmarks.

    [return]

Why don't schools teach debugging?

2014-02-08 08:00:00

In the fall of 2000, I took my first engineering class: ECE 352, an entry-level digital design class for first-year computer engineers. It was standing room only, filled with waitlisted students who would find seats later in the semester as people dropped out. We had been warned in orientation that half of us wouldn't survive the year. In class, We were warned again that half of us were doomed to fail, and that ECE 352 was the weed-out class that would be responsible for much of the damage.

The class moved briskly. The first lecture wasted little time on matters of the syllabus, quickly diving into the real course material. Subsequent lectures built on previous lectures; anyone who couldn't grasp one had no chance at the next. Projects began after two weeks, and also built upon their predecessors; anyone who didn't finish one had no hope of doing the next.

A friend of mine and I couldn't understand why some people were having so much trouble; the material seemed like common sense. The Feynman Method was the only tool we needed.

  1. Write down the problem
  2. Think real hard
  3. Write down the solution

The Feynman Method failed us on the last project: the design of a divider, a real-world-scale project an order of magnitude more complex than anything we'd been asked to tackle before. On the day he assigned the project, the professor exhorted us to begin early. Over the next few weeks, we heard rumors that some of our classmates worked day and night without making progress.

But until 6pm the night before the project was due, my friend and I ignored all this evidence. It didn't surprise us that people were struggling because half the class had trouble with all of the assignments. We were in the half that breezed through everything. We thought we'd start the evening before the deadline and finish up in time for dinner.

We were wrong.

An hour after we thought we'd be done, we'd barely started; neither of us had a working design. Our failures were different enough that we couldn't productively compare notes. The lab, packed with people who had been laboring for weeks alongside those of us who waited until the last minute, was full of bad news: a handful of people had managed to produce a working division unit on the first try, but no one had figured how to convert an incorrect design into something that could do third-grade arithmetic.

I proceeded to apply the only tool I had: thinking really hard. That method, previously infallible, now yielded nothing but confusion because the project was too complex to visualize in its entirety. I tried thinking about the parts of the design separately, but that only revealed that the problem was in some interaction between the parts; I could see nothing wrong with each individual component. Thinking about the relationship between pieces was an exercise in frustration, a continual feeling that the solution was just out of reach, as concentrating on one part would push some other critical piece of knowledge out of my head. The following semester I would acquire enough experience in managing complexity and thinking about collections of components as black-box abstractions that I could reason about a design another order of magnitude more complicated without problems — but that was three long winter months of practice away, and this night I was at a loss for how to proceed.

By 10pm, I was starving and out of ideas. I rounded up people for dinner, hoping to get a break from thinking about the project, but all we could talk about was how hopeless it was. How were we supposed to finish when the only approach was to flawlessly assemble thousands of parts without a single misstep? It was a tedious version of a deranged Atari game with no lives and no continues. Any mistake was fatal.

A number of people resolved to restart from scratch; they decided to work in pairs to check each other's work. I was too stubborn to start over and too inexperienced to know what else to try. After getting back to the lab, now half empty because so many people had given up, I resumed staring at my design, as if thinking about it for a third hour would reveal some additional insight.

It didn't. Nor did the fourth hour.

And then, just after midnight, a number of our newfound buddies from dinner reported successes. Half of those who started from scratch had working designs. Others were despondent, because their design was still broken in some subtle, non-obvious way. As I talked with one of those students, I began poring over his design. And after a few minutes, I realized that the Feynman method wasn't the only way forward: it should be possible to systematically apply a mechanical technique repeatedly to find the source of our problems. Beneath all the abstractions, our projects consisted purely of NAND gates (woe to those who dug around our toolbox enough to uncover dynamic logic), which outputs a 0 only when both inputs are 1. If the correct output is 0, both inputs should be 1. If the output is, incorrectly, 1, then at least one of the inputs must incorrectly be 0. The same logic can then be applied with the opposite polarity. We did this recursively, finding the source of all the problems in both our designs in under half an hour.

We excitedly explained our newly discovered technique to those around us, walking them through a couple steps. No one had trouble; not even people who'd struggled with every previous assignment. Within an hour, the group of folks within earshot of us had finished, and we went home.

I understand now why half the class struggled with the earlier assignments. Without an explanation of how to systematically approach problems, anyone who didn't intuitively grasp the correct solution was in for a semester of frustration. People who were, like me, above average but not great, skated through most of the class and either got lucky or wasted a huge chunk of time on the final project. I've even seen people talented enough to breeze through the entire degree without ever running into a problem too big to intuitively understand; those people have a very bad time when they run into a 10 million line codebase in the real world. The more talented the engineer, the more likely they are to hit a debugging wall outside of school.

What I don't understand is why schools don't teach systematic debugging. It's one of the most fundamental skills in engineering: start at the symptom of a problem and trace backwards to find the source. It takes, at most, half an hour to teach the absolute basics – and even that little bit would be enough to save a significant fraction of those who wash out and switch to non-STEM majors. Using the standard engineering class sequence of progressively more complex problems, a focus on debugging could expand to fill up to a semester, which would be enough to cover an obnoxious real-world bug: perhaps there's a system that crashes once a day when a Blu-ray DVD is repeatedly played using hardware acceleration with a specific video card while two webcams and record something with significant motion, as long as an obscure benchmark from 1994 is running1.

This dynamic isn't unique to ECE 352, or even Wisconsin – I saw the same thing when TA'ed EE 202, a second year class on signals and systems at Purdue. The problems were FFTs and Laplace transforms instead of dividers and Boolean2, but the avoidance of teaching fundamental skills was the same. It was clear, from the questions students asked me in office hours, that those who were underperforming weren't struggling with the fundamental concepts in the class, but with algebra: the problems were caused by not having an intuitive understanding of, for example, the difference between f(x+a) and f(x)+a.

When I suggested to the professor3 that he spend half an hour reviewing algebra for those students who never had the material covered cogently in high school, I was told in no uncertain terms that it would be a waste of time because some people just can't hack it in engineering. I was told that I wouldn't be so naive once the semester was done, because some people just can't hack it in engineering. I was told that helping students with remedial material was doing them no favors; they wouldn't be able to handle advanced courses anyway because some students just can't hack it in engineering. I was told that Purdue has a loose admissions policy and that I should expect a high failure rate, because some students just can't hack it in engineering.

I agreed that a few students might take an inordinately large amount of help, but it would be strange if people who were capable of the staggering amount of memorization required to pass first year engineering classes plus calculus without deeply understanding algebra couldn't then learn to understand the algebra they had memorized. I'm no great teacher, but I was able to get all but one of the office hour regulars up to speed over the course of the semester. An experienced teacher, even one who doesn't care much for teaching, could have easily taught the material to everyone.

Why do we leave material out of classes and then fail students who can't figure out that material for themselves? Why do we make the first couple years of an engineering major some kind of hazing ritual, instead of simply teaching people what they need to know to be good engineers? For all the high-level talk about how we need to plug the leaks in our STEM education pipeline, not only are we not plugging the holes, we're proud of how fast the pipeline is leaking.

Thanks to Kelley Eskridge, @brcpo9, and others for comments/corrections.

Elsewhere


  1. This is an actual CPU bug I saw that took about a month to track down. And this is the easy form of the bug, with a set of ingredients that causes the fail to be reproduced about once a day - the original form of the bug only failed once every few days. I'm not picking this example because it's particularly hard, either: I can think of plenty of bugs that took longer to track down and had stranger symptoms, including a disastrous bug that took six months for our best debugger to understand.

    For ASIC post-silicon debug folks out there, this chip didn't have anything close to full scan, and our only method of dumping state out of the chip perturbed the state of the chip enough to make some bugs disappear. Good times. On the bright side, after dealing with non-deterministic hardware bugs with poor state visibility, software bugs seem easy. At worst, they're boring and tedious because debugging them is a matter of tracing things backwards to the source of the issue.

    [return]
  2. A co-worker of mine told me about a time at Cray when a high-level PM referred to the lack of engineering resources by saying that the project “needed more Boolean.” Ever since, I've thought of digital designers as people who consume caffeine and produce Boolean. I'm still not sure what analog magicians produce. [return]
  3. When I TA'd EE 202, there were two separate sections taught be two different professors. The professor who told me that students who fail just can't hack it was the professor who was more liked by students. He's affable and charismatic and people like him. Grades in his section were also lower than grades under the professor who people didn't like because he was thought to be mean. TA'ing this class taught me quite a bit, that people have no idea who's doing a good job and who's helping them, and also basic signals and systems (I took signals and systems I as an undergrad to fulfill a requirement and showed up to exams and passed them without learning any of the material, so to walk students through signals and systems II, I had to actually learn the material from both signals and systems I and II; before TA'ing the course, I told the department I hadn't taken the class and should probably TA a different class, but they didn't care, which taught another good life lesson). [return]

Do programmers need math?

2014-01-09 08:00:00

Dear David,

I'm afraid my off the cuff response the other day wasn't too well thought out; when you talked about taking calc III and linear algebra, and getting resistance from one of your friends because "wolfram alpha can do all of that now," my first reaction was horror-- which is why I replied that while I've often regretted not taking a class seriously because I've later found myself in a situation where I could have put the skills to good use, I've never said to myself "what a waste of time it was to learn that fundamental mathematical concept and use it enough to that I truly understand it."

But could this be selection bias? It's easier to recall the math that I use than the math I don't. To check, let's look at the nine math classes I took as an undergrad. If I exclude the jobs I've had that are obviously math oriented (pure math and CS theory, plus femtosecond optics), and consider only whether I've used math skills in non-math-oriented work, here's what I find: three classes whose material I've used daily for months or years on end (Calc I/II, Linear Algebra, and Calc III); three classes that have been invaluable for short bursts (Combinatorics, Error Correcting Codes, and Computational Learning Theory); one course I would have had use for had I retained any of the relevant information when I needed it (Graduate Level Matrix Analysis); one class whose material I've only relied on once (Mathematical Economics); and only one class I can't recall directly applying to any non-math-y work (Real Analysis). Here's how I ended up using these:

Calculus I/II1: critical for dealing with real physical things as well as physically inspired algorithms. Moreover, one of my most effective tricks is substituting a Taylor or Remez series (or some other approximation function) for a complicated function, where the error bounds aren't too high and great speed is required.

Linear Algebra: although I've gone years without, it's hard to imagine being able to dodge linear algebra for the rest of my career because of how general matrices are.

Calculus III: same as Calc I/II.

Combinatorics: useful for impressing people in interviews, if nothing else. Most of my non-interview use of combinatorics comes from seeing simplifications of seemingly complicated problems; combines well with probability and randomized algorithms.

Error Correcting Codes: there's no substitute when you need ECC. More generally, information theory is invaluable.

Graduate Level Matrix Analysis: had a decade long gap between learning this and working on something where the knowledge would be applicable. Still worthwhile, though, for the same reason Linear Algebra is important.

Real Analysis: can't recall any direct applications, although this material is useful for understanding topology and measure theory.

Computational Learning Theory: useful for making the parts of machine learning people think are scary quite easy, and for providing an intuition for areas of ML that are more alchemy than engineering.

Mathematical Economics: Lagrange multipliers have come in handy sometimes, but more for engineering than programming.

Seven out of nine. Not bad. So I'm not sure how to reconcile my experience with the common sentiment that, outside of a handful of esoteric areas like computer graphics and machine learning, there is no need to understand textbook algorithms, let alone more abstract concepts like math.

Part of it is selection bias in the jobs I've landed; companies that do math-y work are more likely to talk to me. A couple weeks ago, I had a long discussion with a group of our old Hacker School friends, who now do a lot of recruiting at career fairs; a couple of them, whose companies don't operate at the intersection of research and engineering, mentioned that they politely try to end the discussion when they run into someone like me because they know that I won't take a job with them2.

But it can't all be selection bias. I've gotten a lot of mileage out of math even in jobs that are not at all mathematical in nature. Even in low-level systems work that's as far removed from math as you can get, it's not uncommon to be find a simple combinatorial proof to show that a solution that seems too stupid to be correct is actually optimal, or correct with high probability; even when doing work that's far outside the realm of numerical methods, it sometimes happens that the bottleneck is a function that can be more quickly computed using some freshman level approximation technique like a Taylor expansion or Newton's method.

Looking back at my career, I've gotten more bang for the buck from understanding algorithms and computer architecture than from understanding math, but I really enjoy math and I'm glad that knowing a bit of it has biased my career towards more mathematical jobs, and handed me some mathematical interludes in profoundly non-mathematical jobs.

All things considered, my real position is a bit more relaxed than I thought: if you enjoy math, taking more classes for the pure joy of solving problems is worthwhile, but math classes aren't the best use of your time if your main goal is to transition from an academic career to programming.



Cheers,
Dan

Russian translation available here


  1. A brilliant but mad lecturer crammed both semesters of the theorem/proof-oriented Apostol text into two months and then started lecturing about complex analysis when we ran out of book. I didn't realize that math is fun until I took this class. This footnote really ought to be on the class name, but rdiscount doesn't let you put a footnote on or in bolded text. [return]
  2. This is totally untrue, by the way. It would be super neat to see what a product oriented role is like. As it is now, I'm five teams removed from any actual customer. Oh well. I'm one step closer than I was in my last job. [return]

Data alignment and caches

2014-01-02 08:00:00

Here's the graph of a toy benchmark1 of page-aligned vs. mis-aligned accesses; it shows a ratio of performance between the two at different working set sizes. If this benchmark seems contrived, it actually comes from a real world example of the disastrous performance implications of using nice power of 2 alignment, or page alignment in an actual system2.

Graph of Sandy Bridge Performance Graph of Westmere Performance

Except for very small working sets (1-8), the unaligned version is noticeably faster than the page-aligned version, and there's a large region up to a working set size of 512 where the ratio in performance is somewhat stable, but more so on our Sandy Bridge chip than our Westmere chip.

To understand what's going on here, we have to look at how caches organize data. By way of analogy, consider a 1,000 car parking garage that has 10,000 permits. With a direct mapped scheme (which you could call 1-way associative3), each of the ten permits that has the same 3 least significant digits would be assigned the same spot, i.e., permits 0618, 1618, 2618, and so on, are only allowed to park in spot 618. If you show up at your spot and someone else is in it, you kick them out and they have to drive back home. The next time they get called in to work, they have to drive all the way back to the parking garage.

Instead, if each car's permit allows it to park in a set that has ten possible spaces, we'll call that a 10-way set associative scheme, which gives us 100 sets of ten spots. Each set is now defined by the last 2 significant digits instead of the last 3. For example, with permit 2618, you can park in any spot from the set {018, 118, 218, …, 918}. If all of them are full, you kick out one unlucky occupant and take their spot, as before.

Let's move out of analogy land and back to our benchmark. The main differences are that there isn't just one garage-cache, but a hierarchy of them, from the L14, which is the smallest (and hence, fastest) to the L2 and L3. Each seat in a car corresponds to an address. On x86, each addresses points to a particular byte. In the Sandy Bridge chip we're running on, we've got a 32kB L1 cache with 64-byte line size and, 64 sets, with 8-way set associativity. In our analogy, a line size of 64 would correspond to a car with 64 seats. We always transfer things in 64-byte chunks and the bottom log₂(64) = 6 bits of an address refer to a particular byte offset in a cache line. The next log₂(64) = 6 bits determine which set an address falls into5. Each of those sets can contain 8 different things, so we have 64 sets * 8 lines/set * 64 bytes/line = 32kB. If we use the cache optimally, we can store 32,768 items. But, since we're accessing things that are page (4k) aligned, we effectively lose the bottom log₂(4k) = 12 bits, which means that every access falls into the same set, and we can only loop through 8 things before our working set is too large to fit in the L1! But if we'd misaligned our data to different cache lines, we'd be able to use 8 * 64 = 512 locations effectively.

Similarly, our chip has a 512 set L2 cache, of which 8 sets are useful for our page aligned accesses, and a 12288 set L3 cache, of which 192 sets are useful for page aligned accesses, giving us 8 sets * 8 lines / set = 64 and 192 sets * 8 lines / set = 1536 useful cache lines, respectively. For data that's misaligned by a cache line, we have an extra 6 bits of useful address, which means that our L2 cache now has 32,768 useful locations.

In the Sandy Bridge graph above, there's a region of stable relative performance between 64 and 512, as the page-aligned version version is running out of the L3 cache and the unaligned version is running out of the L1. When we pass a working set of 512, the relative ratio gets better for the aligned version because it's now an L2 access vs. an L3 access. Our graph for Westmere looks a bit different because its L3 is only 3072 sets, which means that the aligned version can only stay in the L3 up to a working set size of 384. After that, we can see the terrible performance we get from spilling into main memory, which explains why the two graphs differ in shape above 384.

For a visualization of this, you can think of a 32 bit pointer looking like this to our L1 and L2 caches:

TTTT TTTT TTTT TTTT TTTT SSSS SSXX XXXX

TTTT TTTT TTTT TTTT TSSS SSSS SSXX XXXX

The bottom 6 bits are ignored, the next bits determine which set we fall into, and the top bits are a tag that let us know what's actually in that set. Note that page aligning things, i.e., setting the address to

???? ???? ???? ???? ???? 0000 0000 0000

was just done for convenience in our benchmark. Not only will aligning to any large power of 2 cause a problem, generating addresses with a power of 2 offset from each other will cause the same problem.

Nowadays, the importance of caches is well understood enough that, when I'm asked to look at a cache related performance bug, it's usually due to the kind of thing we just talked about: conflict misses that prevent us from using our full cache effectively6. This isn't the only way for that to happen -- bank conflicts and and false dependencies are also common problems, but I'll leave those for another blog post.

Resources

For more on caches on memory, see What Every Programmer Should Know About Memory. For something with more breadth, see this blog post for something "short", or Modern Processor Design for something book length. For even more breadth (those two links above focus on CPUs and memory), see Computer Architecture: A Quantitative Approach, which talks about the whole system up to the datacenter level.


  1. The Sandy Bridge is an i7 3930K and the Westmere is a mobile i3 330M [return]
  2. Or anyone who aligned their data too nicely on a calculation with two source arrays and one destination when running on a chip with a 2-way associative or direct mapped cache. This is surprisingly common when you set up your arrays in some nice way in order to do cache blocking, if you're not careful. [return]
  3. Don't call it that. People will you look at you funny the same way they would if you pronounced SQL as squeal or squll. [return]
  4. In this post, L1 refers to the l1d. Since we're only concerned with data, the l1i isn't relevant. Apologies for the sloppy use of terminology. [return]
  5. If it seems odd that the least significant available address bits are used for the set index, that's because of the cardinal rule of computer architecture, make the common case fast -- Google Instant completes “make the common” to “make the common case fast”, “make the common case fast mips”, and “make the common case fast computer architecture”. The vast majority of accesses are close together, so moving the set index bits upwards would cause more conflict misses. You might be able to get away with a hash function that isn't simply the least significant bits, but most proposed schemes hurt about as much as they help while adding extra complexity. [return]
  6. Cache misses are often described using the 3C model: conflict misses, which are caused by the type of aliasing we just talked about; compulsory misses, which are caused by the first access to a memory location; and capacity misses, which are caused by having a working set that's too large for a cache, even without conflict misses. Page-aligned accesses like these also make compulsory misses worse, because prefetchers won't prefetch beyond a page boundary. But if you have enough data that you're aligning things to page boundaries, you probably can't do much about that anyway. [return]

PCA is not a panacea

2013-12-13 08:00:00

Earlier this year, I interviewed with a well-known tech startup, one of the hundreds of companies that claims to have harder interviews, more challenging work, and smarter employees than Google1. My first interviewer, John, gave me the standard tour: micro-kitchen stocked with a combination of healthy snacks and candy; white male 20-somethings gathered around a foosball table; bright spaces with cutesy themes; a giant TV set up for video games; and the restroom. Finally, he showed me a closet-sized conference room and we got down to business.

After the usual data structures and algorithms song and dance, we moved on to the main question: how would you design a classification system for foo2? We had a discussion about design tradeoffs, but the key disagreement was about the algorithm. I said, if I had to code something up in an interview, I'd use a naive matrix factorization algorithm, but that I didn't expect that I would get great results because not everything can be decomposed easily. John disagreed – he was adamant that PCA was the solution for any classification problem.

We discussed the mathematical underpinnings for twenty-five minutes – half the time allocated for the interview – and it became clear that neither of us was going to convince the other with theory. I switched gears and tried the empirical approach, referring to an old result on classifying text with LSA (which can only capture pairwise correlations between words)3 vs. deep learning4. Here's what you get with LSA:

2-d LSA

Each color represents a different type of text, projected down to two dimensions; you might not want to reduce to the dimensionality that much, but it's a good way to visualize what's going on. There's some separation between the different categories; the green dots tend to be towards the bottom right, the black dots are a lot denser in the top half of the diagram, etc. But any classification based on that is simply not going to be very good when documents are similar and the differences between them are nuanced.

Here's what we get with a deep autoencoder:

2-d deep autoencoder

It's not perfect, but the results are a lot better.

Even after the example, it was clear that I wasn't going to come to an agreement with my interviewer, so I asked if we could agree to disagree and move on to the next topic. No big deal, since it was just an interview. But I see this sort of misapplication of bog standard methods outside of interviews at least once a month, usually with the conviction that all you need to do is apply this linear technique for any problem you might see.

Engineers are the first to complain when consultants with generic business knowledge come in, charge $500/hr and dispense common sense advice while making a mess of the details. But data science is new and hot enough that people get a pass when they call themselves data scientists instead of technology consultants. I don't mean to knock data science (whatever that means), or even linear methods5. They're useful. But I keep seeing people try to apply the same four linear methods to every problem in sight.

In fact, as I was writing this, my girlfriend was in the other room taking a phone interview with the data science group of a big company, where they're attempting to use multivariate regression to predict the performance of their systems and decomposing resource utilization down to the application and query level from the regression coefficient, giving you results like 4000 QPS of foobar uses 18% of the CPU. The question they posed to her, which they're currently working on, was how do you speed up the regression so that you can push their test system to web scale?

The real question is, why would you want to? There's a reason pretty much every intro grad level computer architecture course involves either writing or modifying a simulator; real system performance is full of non-linear cliffs, the sort of thing where you can't just apply a queuing theory model, let alone a linear regression model. But when all you have are linear hammers, non-linear screws look a lot like nails.

In response to this, John Myles White made the good point that linear vs. non-linear isn't really the right framing, and that there really isn't a good vocabulary for talking about this sort of thing. Sorry for being sloppy with terminology. If you want to be more precise, you can replace each mention of "linear" with "mumble mumble objective function" or maybe "simple".


  1. When I was in college, the benchmark was MS. I wonder who's going to be next. [return]
  2. I'm not disclosing the exact problem because they asked to keep the interview problems a secret, so I'm describing a similar problem where matrix decomposition has the same fundamental problems. [return]
  3. If you're familiar with PCA and not LSA, you can think of LSA as something PCA-like [return]
  4. http://www.sciencemag.org/content/313/5786/504.abstract, http://www.cs.toronto.edu/~amnih/cifar/talks/salakhut_talk.pdf. In a strict sense, this work was obsoleted by a slew of papers from 2011 which showed that you can achieve similar results to this 2006 result with "simple" algorithms, but it's still true that current deep learning methods are better than the best "simple" feature learning schemes, and this paper was the first example that came to mind. [return]
  5. It's funny that I'm writing this blog post because I'm a huge fan of using the simplest thing possible for the job. That's often a linear method. Heck, one of my most common tricks is to replace a complex function with a first order Taylor expansion. [return]

Why hardware development is hard

2013-11-10 08:00:00

In CPU design, most successful teams have a fairly long lineage and rely heavily on experienced engineers. When we look at CPU startups, teams that have a sucessful exist often have a core team that's been together for decades. For example, PA Semi's acquisition by Apple was a moderately successful exit, but where did that team come from? They were the SiByte team, which left after SiByte was acquired by Broadcom, and SiByte was composed of many people from DEC who had been working together for over a decade. My old company was similar: an IBM fellow collected the best people he worked with at IBM who was a very early Dell employee and then exec (back when Dell still did interesting design work), then split off to create a chip startup. There have been quite a few CPU startups that have raised tens to hundreds of millions and leaned heavily on inexperienced labor; fresh PhDs and hardware engineers with only a few years of experience. Every single such startup I know of failed1.

This is in stark contrast to software startups, where it's common to see sucessful startups founded by people who are just out of school (or who dropped out of school). Why should microprocessors be any different? It's unheard of for a new, young, team to succeed at making a high-performance microprocessor, although this hasn't stopped people from funding these efforts.

In software, it's common to hear about disdain for experience, such as Zuckerberg's comment, "I want to stress the importance of being young and technical, Young people are just smarter.". Even when people don't explicitly devalue experience, they often don't value it either. As of this writing, Joel Spolsky's ”Smart and gets things done” is probably the most influential piece of writing on software hiring. Note that it doesn't say "smart, experienced, and gets things done.". Just "smart and gets things done" appears to be enough, no experience required. If you lean more towards the Paul Graham camp than the Joel Spolsky camp, there will be a lot of differences in how you hire, but Paul's advice is the same in that experience doesn't rank as one of his most important criteria, except as a diss.

Let's say you wanted to hire a plumber or a carptener, what would you choose? "Smart and gets things done" or "experienced and effective"? Ceteris paribus, I'll go for "experienced and effective", doubly so if it's an emergency.

Physical work isn't the kind of thing you can derive from first principles, no matter how smart you are. Consider South Korea after WWII. Its GDP per capita was lower than Ghana, Kenya, and just barely above the Congo. For various reasons, the new regime didn't have to deal with legacy institutions; and they wanted Korea to become a first-world nation.

The story I've heard is that the government started by subsidizing concrete. After many years making concrete, they wanted to move up the chain and start more complex manufacturing. They eventually got to building ships, because shipping was a critical part of the export economy they wanted to create.

They pulled some of their best business people who had learned skills like management and operations in other manufacturing. Those people knew they didn't have the expertise to build ships themselves, so they contracted it out. They made the choice to work with Scottish firms, because Scotland has a long history of shipbuilding. Makes sense, right?

It didn't work. For historical and geographic reasons, Scotland's shipyards weren't full-sized; they built their ships in two halves and then assembled them. Worked fine for them, because they'd be doing it at scale since the 1800s, and had world renowned expertise by the 1900s. But when the unpracticed Koreans tried to build ships using Scottish plans and detailed step-by-step directions, the result was two ship halves that didn't quite fit together and sunk when assembled.

The Koreans eventually managed to start a shipbuilding industry by hiring foreign companies to come and build ships locally, showing people how it's done. And it took decades to get what we would consider basic manufacturing working smoothly, even though one might think that all of the requisite knowledge existed in books, was taught in university courses, and could be had from experts for a small fee. Now, their manufacturing industries are world class, e.g., according to Consumer Reports, Hyundai and Kia produce reliable cars. Going from producing unreliable econoboxes to reliable cars you can buy took over a decade, like it did for Toyota when they did it decades earlier. If there's a shortcut to quality other than hiring a lot of people who've done it before, no one's discovered it yet.

Today, any programmer can take Geoffrey Hinton's course on neural networks and deep learning, and start applying state of the art machine learning techniques. In software land, you can fix minor bugs in real time. If it takes a whole day to run your regression test suite, you consider yourself lucky because it means you're in one of the few environments that takes testing seriously. If the architecture is fundamentally flawed, you pull out your copy of Feathers' “Working Effectively with Legacy Code” and repeatedly apply fixes.

This isn't to say that software isn't hard, but there are a lot of valueable problems that don't need a decade of hard-won experience to attack. But if you want to build a ship, and you "only" have a decade of experience with carpentry, milling, metalworking, etc., well, good luck. You're going to need it. With a large ship, “minor” fixes can take days or weeks, and a fundamental flaw means that your ship sinks and you've lost half a year of work and tens of millions of dollars. By the time you get to something with the complexity of a modern high-performance microprocessor, a minor bug discovered in production costs three months and millions of dollars. A fundamental flaw in the architecture will cost you five years and hundreds of millions of dollars2.

Physical mistakes are costly. There's no undo and editing isn't simply a matter of pressing some keys; changes consume real, physical resources. You need enough wisdom and experience to avoid common mistakes entirely – especially the ones that can't be fixed.

CPU internals series

2021 comments

In retrospect, I think that I was too optimistic about software in this post. If we're talking about product-market fit and success, I don't think the attitude in the post is wrong and people with little to no experience often do create hits. But now that I've been in the industry for a while and talked to numerous people about infra at various startups as well as large companies, I think creating high quality software infra requires no less experience than creating high quality physical items. Companies that decided this wasn't the case and hire a bunch of smart folks from top schools to build their infra have ended up with low quality, unreliable, expensive, and difficult to operate infrastructure. It just turns out that, if you have very good product-market fit, you don't need your infra to work. Your company can survive and even thrive while having infra that has 2 9s of uptime and costs an order of magnitude more than your competitor's infra or if your product's architecture means that it can't possibly work correctly. You'll make less money than you would've otherwise, but the high order bits are all on the product size. If you contrast that chip companies with inexperienced engineers that didn't produce a working product, well, you can't really sell a product that doesn't work even if you try. If you get very lucky, like if you happened to start deep learning chip company at the right time, you might get big company to acquire your non-working product. But, it's much harder to get an exit like that for a microprocessor.


  1. Comparing my old company to another x86 startup founded within the year is instructive. Both started at around the same time. Both had great teams of smart people. Our competitor even had famous software and business people on their side. But it's notable that their hardware implementers weren't a core team of multi-decade industry veterans who had worked together before. It took us about two years to get a working x86 chip, on top of $15M in funding. Our goal was to produce a low-cost chip and we nailed it. It took them five years, with over $250M in funding. Their original goal was to produce a high performance low-power processor, but they missed their performance target so badly that they were forced into the low-cost space. They ended up with worse performance than us, with a chip was 50% bigger (and hence, cost more than 50% more to produce) using team four times our size. They eventually went under, because there's no way they could survive with 4x our burn rate and weaker performance. But, not before burning through $969M in funding (including $230M from patent lawsuits). [return]
  2. A funny side effect of the importance of experience is that age discrimination doesn't affect the areas I've worked in. At 30, I'm bizarrely young for someone who's done microprocessor design. The core folks at my old place were in their 60s. They'd picked up some younger folks along the way, but 30? Freakishly young. People are much younger at the new gig: I'm surrounded by ex-supercomputer folks from Cray and SGI, who are barely pushing 50, along with a couple kids from Synplify and DESRES who, at 40, are unusually young. Not all hardware folks are that old. In another arm of the company, there are folks who grew up in the FPGA world, which is a lot more forgiving. In that group, I think I met someone who's only a few years older than me. Kidding aside, you'll see younger folks doing RTL design on complex projects at large companies that are willing to spend a decade mentoring folks. But, at startups and on small hardware teams that move fast, it's rare to hire someone into design who doesn't have a decade of experience.

    There's a crowd that's even younger than the FPGA folks, even younger than me, working on Arduinos and microcontrollers, doing hobbyist electronics and consumer products. I'm genuinely curious how many of those folks will decide to work on large-scale systems design. In one sense, it's inevitable, as the area matures, and solutions become more complex. The other sense is what I'm curious about: will the hardware renaissance spark an interest in supercomputers, microprocessors, and warehouse-scale computers?

    [return]

How to discourage open source contributions

2013-10-27 08:00:00

What's the first thing you do when you find a bug or see a missing feature in an open source project? Check out the project page and submit a patch!

Send us a pull request! (116 open pull requests)

Oh. Maybe their message is so encouraging that they get hundreds of pull requests a week, and the backlog isn't that bad.

Multiple people ask why this bug fix is being ignored. No response.

Maybe not. Giant sucker than I am, I submitted a pull request even after seeing that. All things considered, I should consider myself lucky that it's possible to submit pull requests at all. If I'm really lucky, maybe they'll get around to looking at it one day.

I don't mean to pick on this particular project. I can understand how this happens. You're a dev who can merge pull requests, but you're not in charge of triaging bugs and pull requests; you have a day job, projects that you own, and a life outside of coding. Maybe you take a look at the repo every once in a while, merge in good pull requests, and make comments on the ones that need more work, but you don't look at all 116 open pull requests; who has that kind of time?

This behavior, eminently reasonable on the part of any individual, results in a systemic failure, a tax on new open source contributors. I often get asked how to get started with open source. It's easy for me to forget that getting started can be hard because the first projects I contributed to have a response time measured in hours for issues and pull requests1. But a lot of people have experiences which aren't so nice. They contribute a few patches to a couple projects that get ignored, and have no idea where to go from there. It doesn't take egregious individual behavior to create a hostile environment.

That's kept me from contributing to some projects. At my last job, I worked on making a well-known open source project production quality, fixing hundreds of bugs over the course of a couple months. When I had some time, I looked into pushing the changes back to the open source community. But when I looked at the mailing list for the project, I saw a wasteland of good patches that were completely ignored, where the submitter would ping the list a couple times and then give up. Did it seem worth spending a week to disentangle our IP from the project in order to submit a set of patches that would, in all likelihood, get ignored? No.

If you have commit access to a project that has this problem, please own the process for incoming pull requests (or don't ask for pull requests in your repo description). It doesn't have to permanent; just until you have a system in place2. Not only will you get more contributors to your project, you'll help break down one barrier to becoming an open source contributor.

joewiz replies to an month old comment. Asks for review months later. No reply.

For an update on the repo featured in this post, check out this response to a breaking change.


  1. Props to OpenBlas, Rust, jslinux-deobfuscated, and np for being incredibly friendly to new contributors. [return]
  2. I don't mean to imply that this is trivial. It can be hard, if your project doesn't have an accepting culture, but there are popular, high traffic projects that manage to do it. If all else fails, you can always try the pull request hack. [return]

Randomize HN

2013-10-04 08:00:00

You ever notice that there's this funny threshold for getting to the front page on sites like HN? The exact threshold varies depending on how much traffic there is, but, for articles that aren't wildly popular, there's this moment when the article is at N-1 votes. There is, perhaps, a 60% chance that the vote will come and the article will get pushed to the front page, where it will receive a slew of votes. There is, maybe, a 40% chance it will never get the vote that pushes it to the front page, causing it to languish in obscurity forever.

It's non-optimal that an article that will receive 50 votes in expectation has a 60% chance of getting 100+ votes, and a 40% chance of getting 2 votes. Ideally, each article would always get its expected number of votes and stay on the front page for the expected number of time, giving readers exposure to the article in proportion to its popularity. Instead, by random happenstance, plenty of interesting content never makes it the front page, and as a result, the content that does make it gets a higher than optimal level of exposure.

You also see the same problem, with the sign bit flipped, on low traffic sites that push things to the front page the moment they're posted, like lobste.rs and the smaller sub-reddits: they displace links that most people would be interested in by putting links that almost no one cares about on the front page just so that the few things people do care about get enough exposure to be upvoted. On reddit, users "fix" this problem by heavily downvoting most submissions, pushing them off the front page, resulting in a problem that's fundamentally the same as the problem HN has.

Instead of implementing some simple and easy to optimize, sites pile on ad hoc rules. Reddit implemented the rising page, but it fails to solve the problem. On low-traffic subreddits, like r/programming the threshold is so high that it's almost always empty. On high-traffic sub-reddits, anything that's upvoted enough to make it to the rising page is already wildly successful, and whether or not an article becomes successful is heavily dependent on whether or not the first couple voters happen to be people who upvote the post instead of downvoting it, i.e., the problem of getting onto the rising page is no different than the problem of getting to the top normally.

HN tries to solve the problem by manually penalizing certain domains and keywords. That doesn't solve the problem for the 95% of posts that aren't penalized. For posts that don't make it to the front page, the obvious workaround is to delete and re-submit your post if it doesn't make the front page the first time around, but that's now a ban worthy offense. Of course, people are working around that, and HN has a workaround for the workaround, and so on. It's endless. That's the problem with "simple" ad hoc solutions.

There's an easy fix, but it's counter-intuitive. By adding a small amount of random noise to the rank of an article, we can smooth out the discontinuity between making it onto the front page and languishing in obscurity. The math is simple, but the intuition is even simpler1. Imagine a vastly oversimplified model where, for each article, every reader upvotes with a fixed probability and the front page gets many more eyeballs than the new page. The result follows. If you like, you can work through the exercise with a more realistic model, but the result is the same2.

Adding noise to smooth out a discontinuity is a common trick when you can settle for an approximate result. I recently employed it to work around the classic floating point problem, where adding a tiny number to a large number results in no change, which is problem when adding many small numbers to some large numbers3. For a simple example of applying this, consider keeping a reduced precision counter that uses loglog(n) bits to store the value. Let countVal(x) = 2^x and inc(x) = if (rand(2^x) == 0) x++4. Like understanding when to apply Taylor series, this is a simple trick that people are often impressed by if they haven't seen it before5.

Update: HN tried this! Dan Gackle tells me that it didn't work very well (it resulted in a lot of low quality junk briefly hitting the front page and then disappearing. I think that might be fixable by tweaking some parameters, but the solution that HN settled on, having a human (or multiple humans) put submissions that are deemed to be good or interesting into a "second chance queue" that boosts the submission onto the front page, works better than an a simple randomized algorithm with no direct human input could with any amount of parameter tweaking. I think this is also true of moderation, where the "new" dang/sctb moderation regime has resulted in a marked increase in comment quality, probably better than anything that could be done with an automated ML-based solution today — Google and FB have some of the most advanced automated systems in the world, and the quality of the result is much worse than what we see on HN.

Also, at the time this post was written (2013), the threshold to get onto the front page was often 2-3 votes, making the marginal impact of a random passerby who happens to like a submission checking the new page very large. Even during off peak times now (in 2019), the threshold seems to be much higher, reducing the amount of randomness. Additionally, the rise in the popularity of HN increased the sheer volume of low quality content that languishes on the new page, which would reduce the exposure that any particular "good" submisison would get if it were among the 30 items on the new page that would randomly get boosted onto the front page. That doesn't mean there aren't still problems with the current system: most people seem to upvote and comment based on the title of the article and not the content (to check this, read the comments of articles that are mistitled before someone calls this out for a partiular post — it's generally quite clear that most commenters haven't even skimmed the article, let alone read it), but that's a topic for a different post.


  1. Another way to look at it is that it's A/B testing for upvotes (though, to be pedantic it's actually closer to multi-armed bandit). Another is that the distribution of people reading the front page and the new page aren't the same, and randomizing the front page prevents the clique that reads the new page from having undue influence. [return]
  2. If you want to do the exercise yourself, pg once said the formula for HN is: (votes - 1) / (time + 2)^1.5. It's possible the power of the denominator has been tweaked, but as long as it's greater than 1.0, you'll a reasonable result. [return]
  3. Kahan summation wasn't sufficient, for the fundamental same reason it won't work for the simplified example I gave above. [return]
  4. Assume we use a rand function that returns a non-negative integer between 0 and n-1, inclusive. With x = 0, we start counting from 1, as God intended. inc(0) will definitely increment, so we'll increment and correctly count to countVal(1) = 2^1 = 2. Next, we'll increment with probability ½; we'll have to increment twice in expectation to increase x. That works out perfectly because countVal(2) = 2^2 = 4, so we want to increment twice before increasing x. Then we'll increment with probability ¼, and so on and so forth. [return]
  5. See Mitzenmacher for a good introduction to randomized algorithms that also has an explanation of all the math you need to know. If you already apply Chernoff bounds in your sleep, and want something more in-depth, Motwani & Raghavan is awesome. [return]

Writing safe Verilog

2013-09-15 08:00:00

PL troll: a statically typed language with no type declarations. Types are determined entirely using Hungarian notation

Troll? That's how people write Verilog1. At my old company, we had a team of formal methods PhD's who wrote a linter that typechecked our code, based on our naming convention. For our chip (which was small for a CPU), building a model (compiling) took about five minutes, running a single short test took ten to fifteen minutes, and long tests took CPU months. The value of a linter that can run in seconds should be obvious, not even considering the fact that it can take hours of tracing through waveforms to find out why a test failed2.

Lets look at some of the most commonly used naming conventions.

Pipeline stage

When you pipeline hardware, you end up with many versions of the same signal, one for each stage of the pipeline the signal traverses. Even without static checks, you'll want some simple way to differentiate between these, so you might name them foo_s1, foo_s2, and foo_s3, indicating that they originate in the first, second, and third stages, respectively. In any particular stage, a signal is most likely to interact with other signals in the same stage; it's often a mistake when logic from other stages is accessed. There are reasons to access signals from other stages, like bypass paths and control logic that looks at multiple stages, but logic that stays contained within a stage is common enough that it's not too tedious to either “cast” or add a comment that disables the check, when looking at signals from other stages.

Clock domain

Accessing a signal in a different clock domain without synchronization is like accessing a data structure from multiple threads without synchronization. Sort of. But worse. Much worse. Driving combinational logic from a metastable state (where the signal is sitting between a 0 and 1) can burn a massive amount of power3. Here, I'm not just talking about being inefficient. If you took a high-power chip from the late 90s and removed the heat sink, it would melt itself into the socket, even under normal operation. Modern chips have such a high maximum power possible power consumption that the chips would self destruct if you disabled the thermal regulation, even with the heat sink. Logic that's floating at an intermediate value not only uses a lot of power, it bypasses a chip's usual ability to reduce power by slowing down the clock4. Using cross clock domain signals without synchronization is a bad idea, unless you like random errors, high power dissipation, and the occasional literal meltdown.

Module / Region

In high speed designs, it's an error to use a signal that's sourced from another module without registering it first. This will insidiously sneak through simulation; you'll only notice when you look at the timing report. On the last chip I worked on, it took about two days to generate a timing report0. If you accidentally reference a signal from a distant module, not only will you not meet your timing budget for that path, the synthesis tool will allocate resources to try to make that path faster, which will slow down everything else5, making the entire timing report worthless6.

PL Trolling

I'd been feeling naked at my new gig, coding Verilog without any sort of static checking. I put off writing my own checker, because static analysis is one of those scary things you need a PhD to do, right? And writing a parser for SystemVerilog is a ridiculously large task7. But, it turns out that don't need much of a parser, and all the things I've talked about are simple enough that half an hour after starting, I had a tool that found seven bugs, with only two false positives. I expect we'll have 4x as much code by the time we're done, so that's 28 bugs from half an hour of work, not even considering the fact that two of the bugs were in heavily used macros.

I think I'm done for the day, but there are plenty of other easy things to check that will certainly find bugs (e.g, checking for regs/logic that are declared or assigned, but not used). Whenever I feel like tackling a self-contained challenge, there are plenty of not-so-easy things, too (e.g., checking if things aren't clock gated or power gated when they should be, which isn't hard to do statistically, but is non-trivial statically).

Huh. That wasn't so bad. I've now graduated to junior PL troll.


  1. Well, people usually use suffixes as well as prefixes. [return]
  2. You should, of course, write your own tool to script interaction with your waveform view because waveform viewers have such poor interfaces, but that's whole ‘nother blog post. [return]
  3. In static CMOS there's a network of transistors between power and output, and a dual network between ground and output. As a first-order approximation, only one of the two networks should be on at a time, except when switching, which is why switching logic gates use power than unchanging gates -- in addition to the power used to discharge the capacitance that the output is driving, there is, briefly, a direct connection from power to ground. If you get stuck into a half-on state, there's a constant connection from power to ground. [return]
  4. In theory, power gating could help, but you can't just power gate some arbitrary part of the chip that's too hot. [return]
  5. There are a number of reasons that this completely destroys the timing report. First, for any high-speed design, there's not enough fast (wide) interconnect to go around. Gates are at the bottom, and wires sit above them. Wires get wider and faster in higher layers, but there's congestion getting to and from the fast wires, and relatively few of them. There are so few of them that people pre-plan where modules should be placed in order to have enough fast interconnect to meet timing demands. If you steal some fast wires to make some slow path fast, anything relying on having a fast path through that region is hosed. Second, the synthesis tool tries to place sources near sinks, to reduce both congestion and delay. If you place a sink on a net that's very far from the rest of the sinks, the source will migrate halfway in between, to try to match the demands of all the sinks. This is recursively bad, and will pull all the second order sources away from their optimal location, and so on and so forth. [return]
  6. With some tools, you can have them avoid optimizing paths that fail timing by more than a certain margin, but there's still always some window where a bad path will destroy your entire timing report, and it's often the case that there are real critical paths that need all the resources the synthesis tool can throw at it to make it across the chip in time. [return]
  7. The SV standard is 1300 pages long, vs 800 for C++, 500 for C, 300 for Java, and 30 for Erlang. [return]

Verilog is weird

2013-09-07 08:00:00

Verilog is the most commonly used language for hardware design in America (VHDL is more common in Europe). Too bad it's so baroque. If you ever browse the Verilog questions on Stack Overflow, you'll find a large number of questions, usually downvoted, asking “why doesn't my code work?”, with code that's not just a little off, but completely wrong.

6 questions, all but one with negative score

Lets look at an example: “Idea is to store value of counter at the time of reset . . . I get DRC violations and the memory, bufreadaddr, bufreadval are all optimized out.”

always @(negedge reset or posedge clk) begin
  if (reset == 0) begin
    d_out <= 16'h0000;
    d_out_mem[resetcount] <= d_out;
    laststoredvalue <= d_out;
  end else begin
    d_out <= d_out + 1'b1;
  end
end

always @(bufreadaddr)
  bufreadval = d_out_mem[bufreadaddr];

We want a counter that keeps track of how many cycles it's been since reset, and we want to store that value in an array-like structure that's indexed by resetcount. If you've read a bit on semantics of Verilog, this is a perfectly natural way to solve the problem. Our poster knows enough about Verilog to use ‘<=' in state elements, so that all of the elements are updated at the same time. Every time there's a clock edge, we'll increment d_out. When reset is 0, we'll store that value and reset d_out. What could possibly go wrong?

The problem is that Verilog was originally designed as a language to describe simulations, so it has constructs to describe arbitrary interactions between events. When X transitions from 0 to 1, do Y. Great! Sounds easy enough. But then someone had the bright idea of using Verilog to represent hardware. The vast majority of statements you could write down don't translate into any meaningful hardware. Your synthesis tool, which translates from Verilog to hardware will helpfully pattern match to the closest available thing, or produce nothing, if you write down something untranslatable. If you're lucky, you might get some warnings.

Looking at the code above, the synthesis tool will see that there's something called dout which should be a clocked element that's set to something when it shouldn't be reset, and is otherwise asynchronously reset. That's a legit hardware construct, so it will produce an N-bit flip-flop and some logic to make it a counter that gets reset to 0. BTW, this paragraph used to contain a link to http://en.wikipedia.org/wiki/Flip-flop(electronics), but ever since I switched to Hugo, my links to URLs with parens in them are broken, so maybe try copy+pasting that URL into your browser window if you want know what a flip-flop is.

Now, what about the value we're supposed to store on reset? Well, the synthesis tool will see that it's inside a block that's clocked. But it's not supposed to do anything when the clock is active; only when reset is asserted. That's pretty unusual. What's going to happen? Well, that depends on which version of which synthesis tool you're using, and how the programmers of that tool decided to implement undefined behavior.

And then there's the block that's supposed to read out the stored value. It looks like the intent is to create a 64:1 MUX. Putting aside the cycle time issues you'll get with such a wide MUX, the block isn't clocked, so the synthesis tool will have to infer some sort of combinational logic. But, the output is only supposed to change if bufreadaddr changes, and not if d_out_mem changes. It's quite easy to describe that in our simulation language, the but the synthesis tool is going to produce something that is definitely not what the user wants here. Not to mention that laststoredvalue isn't meaningfully connected to bufreadvalue.

How is it possible that a reasonable description of something in Verilog turns into something completely wrong in hardware? You can think of hardware as some state, with pure functions connecting the state elements. This makes it natural to think about modeling hardware in a functional programming language. Another natural way to think about it would be with OO. Classes describe how the hardware works. Instances of the class are actual hardware that will get put onto the chip. Yet another natural way to describe things would be declaratively, where you write down constraints the hardware must obey, and the synthesis tool outputs something that meets those constraints.

Verilog does none of these things. To write Verilog that will produce correct hardware, you have to first picture the hardware you want to produce. Then, you have to figure out how to describe that in this weird C-like simulation language. That will then get synthesized into something like what you were imaging in the first step.

As a software engineer, how would you feel if 99% of valid Java code ended up being translated to something that produced random results, even though tests pass on the untranslated Java code? And, by the way, to run tests on the translated Java code you have to go through a multi-day long compilation process, after which your tests will run 200 million times slower than code runs in production. If you're thinking of testing on some sandboxed production machines, sure, go ahead, but it costs 8 figures to push something to any number of your production machines, and it takes 3 months. But, don't worry, you can run the untranslated code only 2 million times slower than in production 1. People used to statically typed languages often complain that you get run-time errors about things that would be trivial to statically check in a language with stronger types. We hardware folks are so used to the vast majority of legal Verilog constructs producing unsynthesizable garbage that we don't find it the least bit surprising that you not only do you not get compile-time errors, you don't even get run-time errors, from writing naive Verilog code.

Old school hardware engineers will tell you that it's fine. It's fine that the language is so counter-intuitive that almost all people who initially approach Verilog write code that's not just wrong but nonsensical. "All you have to do is figure out the design and then translate it to Verilog". They'll tell you that it's totally fine that the mental model you have of what's going on is basically unrelated to the constructs the language provides, and that they never make errors now that they're experienced, much like some experienced C programmers will erronously tell you that they never have security related buffer overflows or double frees or memory leaks now that they're experienced. It reminds me of talking to assembly programmers who tell me that assembly is as productive as a high level language once you get your functions written. Programmers who haven't talked to old school assembly programmers will think I'm making that up, but I know a number of people who still maintain that assembly is as productive as any high level langauge out there. But people like that are rare and becoming rarer. With hardware, we train up a new generation of people who think that Verilog is as productive as any language could be every few years!

I won't even get into how Verilog is so inexpressive that many companies use an ad hoc tool to embed a scripting language in Verilog or generate Verilog from a scripting language.

There have been a number of attempts to do better than jamming an ad hoc scripting language into Verilog, but they've all fizzled out. As a functional language that's easy to add syntax to, Haskell is a natural choice for Verilog code generation; it spawned ForSyDe, Hydra, Lava, HHDL, and Bluespec. But adoption of ForSyDe, Hydra, Lava, and HHDL is pretty much zero, not because of deficiencies in the language, but because it's politically difficult to get people to use a Haskell based language. Bluespec has done better, but they've done it by making their language look C-like, scrapping the original Haskell syntax and introducing Bluespec SystemVerilog and Bluespec SystemC. The aversion to Haskell is so severe that when we discussed a hardware style at my new gig, one person suggested banning any Haskell based solution, even though Bluespec has been used to good effect in a couple projects within the company.

Scala based solutions look more promising, not for any technical reason, but because Scala is less scary. Scala has managed to bring the modern world (in terms of type systems) to more programmers than ML, Ocaml, Haskell, Agda, etc., combined. Perhaps the same will be true in the hardware world. Chisel is interesting. Like Bluespec, it simulates much more quickly than Verilog, and unsynthesizable representations are syntax errors. It's not as high level, but it's the only hardware description language with a modern type system that I've been able to discuss with hardware folks without people objecting that Haskell is a bad idea.

Commercial vendors are mostly moving in the other direction because C-like languages make people feel all warm and fuzzy. A number of them are pushing high-level hardware synthesis from SystemC, or even straight C or C++. These solutions are also politically difficult to sell, but this time it's the history of the industry, and not the language. Vendors pushing high-level synthesis have a decades long track record of overpromising and underdelivering. I've lost track of the number of times I've heard people dismiss modern offerings with “Why should we believe that this they're for real this time?”

What's the future? Locally, I've managed to convince a couple of people on my team that Chisel is worth looking at. At the moment, none of the Haskell based solutions are even on the table. I'm open to suggestions.

CPU internals series

P.S. Dear hardware folks, sorry for oversimplifying so much. I started writing footnotes explaining everything I was glossing over until I realized that my footnotes were longer than the post. The culled footnotes may make it into their own blog posts some day. A very long footnote that I'll briefly summarize is that semantically correct Verilog simulation is inherently slower than something like Bluespec or Chisel because of the complications involved with the event model. EDA vendors have managed to get decent performance out of Verilog, but only by hiring large teams of the best simulation people in the world to hammer at the problem, the same way JavaScript is fast not because of any property of the language, but because there are amazing people working on the VM. It should tell you something when a tiny team working on a shoestring grant-funded budget can produce a language and simulation infrastructure that smokes existing tools.

You may wonder why I didn't mention linters. They're a great idea and for reasons I don't understand, two of the three companies I've done hardware development for haven't used linters. If you ask around, everyone will agree that they're a good idea, but even though a linter will run in the thousands to tens of thousands of dollars range, and engineers run in hundreds of thousands of dollars range, it hasn't been politically possible to get a linter even on multi-person teams that have access to tools that cost tens or hundreds of thousands of dollars per license per year. Even though linters are a no-brainer, companies that spend millions to tens of millions a year on hardware development often don't use them, and good SystemVerilog linters are all out of the price range of the people who are asking StackOverflow questions that get downvoted to oblivion.


  1. Approximate numbers from the last chip I worked on. We had licenses for both major commercial simulators, and we were lucky to get 500Hz, pre-synthesis, on the faster of the two, for a chip that ran at 2GHz in silicon. Don't even get me started on open source simulators. The speed is at least 10x better for most ASIC work. Also, you can probably do synthesis much faster if you don't have timing / parasitic extraction baked into the process. [return]

About danluu.com

2013-09-01 08:00:00

About The Blog

This started out as a way to jot down thoughts on areas that seem interesting but underappreciated. Since then, this site has grown to the point where it gets millions of hits a month and I see that it's commonly cited by professors in their courses and on stackoverflow.

That's flattering, but more than anything else, I view that as a sign there's a desperate shortage of understandable explanation of technical topics. There's nothing here that most of my co-workers don't know (with the exception of maybe three or four posts where I propose novel ideas). It's just that they don't blog and I do. I'm not going to try to convince you to start writing a blog, since that has to be something you want to do, but I will point out that there's a large gap that's waiting to be filled by your knowledge. When I started writing this blog, I figured almost no one would ever read it; sure Joel Spolsky and Steve Yegge created widely read blogs, but that was back when almost no one was blogging. Now that there are millions of blogs, there's just no way to start a new blog and get noticed. Turns out that's not true.

This site also archives a few things that have fallen off the internet, like this history of subspace, the 90s video game, the su3su2u1 introduction to physics, the su3su2u1 review of hpmor, Dan Weinreb's history of Symbolics and Lisp machines, this discussion of open vs. closed social networks, this discussion about the differences between SV and Boston, and Stanford and MIT, the comp.programming.threads FAQ, and this presentation about Microsoft culture from 2000.

P.S. If you enjoy this blog, you'd probably enjoy RC, which I've heard called "nerd camp for programmers".

Latency mitigation strategies (by John Carmack)

2013-03-05 08:00:00

this is an archive of an old article by John Carmack which seems to have disappeared off of the internet

Abstract

Virtual reality (VR) is one of the most demanding human-in-the-loop applications from a latency standpoint. The latency between the physical movement of a user’s head and updated photons from a head mounted display reaching their eyes is one of the most critical factors in providing a high quality experience.

Human sensory systems can detect very small relative delays in parts of the visual or, especially, audio fields, but when absolute delays are below approximately 20 milliseconds they are generally imperceptible. Interactive 3D systems today typically have latencies that are several times that figure, but alternate configurations of the same hardware components can allow that target to be reached.

A discussion of the sources of latency throughout a system follows, along with techniques for reducing the latency in the processing done on the host system.

Introduction

Updating the imagery in a head mounted display (HMD) based on a head tracking sensor is a subtly different challenge than most human / computer interactions. With a conventional mouse or game controller, the user is consciously manipulating an interface to complete a task, while the goal of virtual reality is to have the experience accepted at an unconscious level.

Users can adapt to control systems with a significant amount of latency and still perform challenging tasks or enjoy a game; many thousands of people enjoyed playing early network games, even with 400+ milliseconds of latency between pressing a key and seeing a response on screen.

If large amounts of latency are present in the VR system, users may still be able to perform tasks, but it will be by the much less rewarding means of using their head as a controller, rather than accepting that their head is naturally moving around in a stable virtual world. Perceiving latency in the response to head motion is also one of the primary causes of simulator sickness. Other technical factors that affect the quality of a VR experience, like head tracking accuracy and precision, may interact with the perception of latency, or, like display resolution and color depth, be largely orthogonal to it.

A total system latency of 50 milliseconds will feel responsive, but still subtly lagging. One of the easiest ways to see the effects of latency in a head mounted display is to roll your head side to side along the view vector while looking at a clear vertical edge. Latency will show up as an apparent tilting of the vertical line with the head motion; the view feels “dragged along” with the head motion. When the latency is low enough, the virtual world convincingly feels like you are simply rotating your view of a stable world.

Extrapolation of sensor data can be used to mitigate some system latency, but even with a sophisticated model of the motion of the human head, there will be artifacts as movements are initiated and changed. It is always better to not have a problem than to mitigate it, so true latency reduction should be aggressively pursued, leaving extrapolation to smooth out sensor jitter issues and perform only a small amount of prediction.

Data collection

It is not usually possible to introspectively measure the complete system latency of a VR system, because the sensors and display devices external to the host processor make significant contributions to the total latency. An effective technique is to record high speed video that simultaneously captures the initiating physical motion and the eventual display update. The system latency can then be determined by single stepping the video and counting the number of video frames between the two events.

In most cases there will be a significant jitter in the resulting timings due to aliasing between sensor rates, display rates, and camera rates, but conventional applications tend to display total latencies in the dozens of 240 fps video frames.

On an unloaded Windows 7 system with the compositing Aero desktop interface disabled, a gaming mouse dragging a window displayed on a 180 hz CRT monitor can show a response on screen in the same 240 fps video frame that the mouse was seen to first move, demonstrating an end to end latency below four milliseconds. Many systems need to cooperate for this to happen: The mouse updates 500 times a second, with no filtering or buffering. The operating system immediately processes the update, and immediately performs GPU accelerated rendering directly to the framebuffer without any page flipping or buffering. The display accepts the video signal with no buffering or processing, and the screen phosphors begin emitting new photons within microseconds.

In a typical VR system, many things go far less optimally, sometimes resulting in end to end latencies of over 100 milliseconds.

Sensors

Detecting a physical action can be as simple as a watching a circuit close for a button press, or as complex as analyzing a live video feed to infer position and orientation.

In the old days, executing an IO port input instruction could directly trigger an analog to digital conversion on an ISA bus adapter card, giving a latency on the order of a microsecond and no sampling jitter issues. Today, sensors are systems unto themselves, and may have internal pipelines and queues that need to be traversed before the information is even put on the USB serial bus to be transmitted to the host.

Analog sensors have an inherent tension between random noise and sensor bandwidth, and some combination of analog and digital filtering is usually done on a signal before returning it. Sometimes this filtering is excessive, which can contribute significant latency and remove subtle motions completely.

Communication bandwidth delay on older serial ports or wireless links can be significant in some cases. If the sensor messages occupy the full bandwidth of a communication channel, latency equal to the repeat time of the sensor is added simply for transferring the message. Video data streams can stress even modern wired links, which may encourage the use of data compression, which usually adds another full frame of latency if not explicitly implemented in a pipelined manner.

Filtering and communication are constant delays, but the discretely packetized nature of most sensor updates introduces a variable latency, or “jitter” as the sensor data is used for a video frame rate that differs from the sensor frame rate. This latency ranges from close to zero if the sensor packet arrived just before it was queried, up to the repeat time for sensor messages. Most USB HID devices update at 125 samples per second, giving a jitter of up to 8 milliseconds, but it is possible to receive 1000 updates a second from some USB hardware. The operating system may impose an additional random delay of up to a couple milliseconds between the arrival of a message and a user mode application getting the chance to process it, even on an unloaded system.

Displays

On old CRT displays, the voltage coming out of the video card directly modulated the voltage of the electron gun, which caused the screen phosphors to begin emitting photons a few microseconds after a pixel was read from the frame buffer memory.

Early LCDs were notorious for “ghosting” during scrolling or animation, still showing traces of old images many tens of milliseconds after the image was changed, but significant progress has been made in the last two decades. The transition times for LCD pixels vary based on the start and end values being transitioned between, but a good panel today will have a switching time around ten milliseconds, and optimized displays for active 3D and gaming can have switching times less than half that.

Modern displays are also expected to perform a wide variety of processing on the incoming signal before they change the actual display elements. A typical Full HD display today will accept 720p or interlaced composite signals and convert them to the 1920×1080 physical pixels. 24 fps movie footage will be converted to 60 fps refresh rates. Stereoscopic input may be converted from side-by-side, top-down, or other formats to frame sequential for active displays, or interlaced for passive displays. Content protection may be applied. Many consumer oriented displays have started applying motion interpolation and other sophisticated algorithms that require multiple frames of buffering.

Some of these processing tasks could be handled by only buffering a single scan line, but some of them fundamentally need one or more full frames of buffering, and display vendors have tended to implement the general case without optimizing for the cases that could be done with low or no delay. Some consumer displays wind up buffering three or more frames internally, resulting in 50 milliseconds of latency even when the input data could have been fed directly into the display matrix.

Some less common display technologies have speed advantages over LCD panels; OLED pixels can have switching times well under a millisecond, and laser displays are as instantaneous as CRTs.

A subtle latency point is that most displays present an image incrementally as it is scanned out from the computer, which has the effect that the bottom of the screen changes 16 milliseconds later than the top of the screen on a 60 fps display. This is rarely a problem on a static display, but on a head mounted display it can cause the world to appear to shear left and right, or “waggle” as the head is rotated, because the source image was generated for an instant in time, but different parts are presented at different times. This effect is usually masked by switching times on LCD HMDs, but it is obvious with fast OLED HMDs.

Host processing

The classic processing model for a game or VR application is:

Read user input -> run simulation -> issue rendering commands -> graphics drawing -> wait for vsync -> scanout

I = Input sampling and dependent calculation
S = simulation / game execution
R = rendering engine
G = GPU drawing time
V = video scanout time

All latencies are based on a frame time of roughly 16 milliseconds, a progressively scanned display, and zero sensor and pixel latency.

If the performance demands of the application are well below what the system can provide, a straightforward implementation with no parallel overlap will usually provide fairly good latency values. However, if running synchronized to the video refresh, the minimum latency will still be 16 ms even if the system is infinitely fast. This rate feels good for most eye-hand tasks, but it is still a perceptible lag that can be felt in a head mounted display, or in the responsiveness of a mouse cursor.

Ample performance, vsync:
ISRG------------|VVVVVVVVVVVVVVVV|
.................. latency 16 – 32 milliseconds

Running without vsync on a very fast system will deliver better latency, but only over a fraction of the screen, and with visible tear lines. The impact of the tear lines are related to the disparity between the two frames that are being torn between, and the amount of time that the tear lines are visible. Tear lines look worse on a continuously illuminated LCD than on a CRT or laser projector, and worse on a 60 fps display than a 120 fps display. Somewhat counteracting that, slow switching LCD panels blur the impact of the tear line relative to the faster displays.

If enough frames were rendered such that each scan line had a unique image, the effect would be of a “rolling shutter”, rather than visible tear lines, and the image would feel continuous. Unfortunately, even rendering 1000 frames a second, giving approximately 15 bands on screen separated by tear lines, is still quite objectionable on fast switching displays, and few scenes are capable of being rendered at that rate, let alone 60x higher for a true rolling shutter on a 1080P display.

Ample performance, unsynchronized:
ISRG
VVVVV
..... latency 5 – 8 milliseconds at ~200 frames per second

In most cases, performance is a constant point of concern, and a parallel pipelined architecture is adopted to allow multiple processors to work in parallel instead of sequentially. Large command buffers on GPUs can buffer an entire frame of drawing commands, which allows them to overlap the work on the CPU, which generally gives a significant frame rate boost at the expense of added latency.

CPU:ISSSSSRRRRRR----|
GPU:                |GGGGGGGGGGG----|
VID:                |               |VVVVVVVVVVVVVVVV|
    .................................. latency 32 – 48 milliseconds

When the CPU load for the simulation and rendering no longer fit in a single frame, multiple CPU cores can be used in parallel to produce more frames. It is possible to reduce frame execution time without increasing latency in some cases, but the natural split of simulation and rendering has often been used to allow effective pipeline parallel operation. Work queue approaches buffered for maximum overlap can cause an additional frame of latency if they are on the critical user responsiveness path.

CPU1:ISSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|
GPU :                |                |GGGGGGGGGG------|
VID :                |                |                |VVVVVVVVVVVVVVVV|
     .................................................... latency 48 – 64 milliseconds

Even if an application is running at a perfectly smooth 60 fps, it can still have host latencies of over 50 milliseconds, and an application targeting 30 fps could have twice that. Sensor and display latencies can add significant additional amounts on top of that, so the goal of 20 milliseconds motion-to-photons latency is challenging to achieve.

Latency Reduction Strategies

Prevent GPU buffering

The drive to win frame rate benchmark wars has led driver writers to aggressively buffer drawing commands, and there have even been cases where drivers ignored explicit calls to glFinish() in the name of improved “performance”. Today’s fence primitives do appear to be reliably observed for drawing primitives, but the semantics of buffer swaps are still worryingly imprecise. A recommended sequence of commands to synchronize with the vertical retrace and idle the GPU is:

SwapBuffers();
DrawTinyPrimitive();
InsertGPUFence();
BlockUntilFenceIsReached();

While this should always prevent excessive command buffering on any conformant driver, it could conceivably fail to provide an accurate vertical sync timing point if the driver was transparently implementing triple buffering.

To minimize the performance impact of synchronizing with the GPU, it is important to have sufficient work ready to send to the GPU immediately after the synchronization is performed. The details of exactly when the GPU can begin executing commands are platform specific, but execution can be explicitly kicked off with glFlush() or equivalent calls. If the code issuing drawing commands does not proceed fast enough, the GPU may complete all the work and go idle with a “pipeline bubble”. Because the CPU time to issue a drawing command may have little relation to the GPU time required to draw it, these pipeline bubbles may cause the GPU to take noticeably longer to draw the frame than if it were completely buffered. Ordering the drawing so that larger and slower operations happen first will provide a cushion, as will pushing as much preparatory work as possible before the synchronization point.

Run GPU with minimal buffering:
CPU1:ISSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|
GPU :                |-GGGGGGGGGG-----|
VID :                |                |VVVVVVVVVVVVVVVV|
     ................................... latency 32 – 48 milliseconds

Tile based renderers, as are found in most mobile devices, inherently require a full scene of command buffering before they can generate their first tile of pixels, so synchronizing before issuing any commands will destroy far more overlap. In a modern rendering engine there may be multiple scene renders for each frame to handle shadows, reflections, and other effects, but increased latency is still a fundamental drawback of the technology.

High end, multiple GPU systems today are usually configured for AFR, or Alternate Frame Rendering, where each GPU is allowed to take twice as long to render a single frame, but the overall frame rate is maintained because there are two GPUs producing frames

Alternate Frame Rendering dual GPU:
CPU1:IOSSSSSSS-------|IOSSSSSSS-------|
CPU2:                |RRRRRRRRR-------|RRRRRRRRR-------|
GPU1:                | GGGGGGGGGGGGGGGGGGGGGGGG--------|
GPU2:                |                | GGGGGGGGGGGGGGGGGGGGGGG---------|
VID :                |                |                |VVVVVVVVVVVVVVVV|
     .................................................... latency 48 – 64 milliseconds

Similarly to the case with CPU workloads, it is possible to have two or more GPUs cooperate on a single frame in a way that delivers more work in a constant amount of time, but it increases complexity and generally delivers a lower total speedup.

An attractive direction for stereoscopic rendering is to have each GPU on a dual GPU system render one eye, which would deliver maximum performance and minimum latency, at the expense of requiring the application to maintain buffers across two independent rendering contexts.

The downside to preventing GPU buffering is that throughput performance may drop, resulting in more dropped frames under heavily loaded conditions.

Late frame scheduling

Much of the work in the simulation task does not depend directly on the user input, or would be insensitive to a frame of latency in it. If the user processing is done last, and the input is sampled just before it is needed, rather than stored off at the beginning of the frame, the total latency can be reduced.

It is very difficult to predict the time required for the general simulation work on the entire world, but the work just for the player’s view response to the sensor input can be made essentially deterministic. If this is split off from the main simulation task and delayed until shortly before the end of the frame, it can remove nearly a full frame of latency.

Late frame scheduling:
CPU1:SSSSSSSSS------I|
CPU2:                |RRRRRRRRR-------|
GPU :                |-GGGGGGGGGG-----|
VID :                |                |VVVVVVVVVVVVVVVV|
                    .................... latency 18 – 34 milliseconds

Adjusting the view is the most latency sensitive task; actions resulting from other user commands, like animating a weapon or interacting with other objects in the world, are generally insensitive to an additional frame of latency, and can be handled in the general simulation task the following frame.

The drawback to late frame scheduling is that it introduces a tight scheduling requirement that usually requires busy waiting to meet, wasting power. If your frame rate is determined by the video retrace rather than an arbitrary time slice, assistance from the graphics driver in accurately determining the current scanout position is helpful.

View bypass

An alternate way of accomplishing a similar, or slightly greater latency reduction Is to allow the rendering code to modify the parameters delivered to it by the game code, based on a newer sampling of user input.

At the simplest level, the user input can be used to calculate a delta from the previous sampling to the current one, which can be used to modify the view matrix that the game submitted to the rendering code.

Delta processing in this way is minimally intrusive, but there will often be situations where the user input should not affect the rendering, such as cinematic cut scenes or when the player has died. It can be argued that a game designed from scratch for virtual reality should avoid those situations, because a non-responsive view in a HMD is disorienting and unpleasant, but conventional game design has many such cases.

A binary flag could be provided to disable the bypass calculation, but it is useful to generalize such that the game provides an object or function with embedded state that produces rendering parameters from sensor input data instead of having the game provide the view parameters themselves. In addition to handling the trivial case of ignoring sensor input, the generator function can incorporate additional information such as a head/neck positioning model that modified position based on orientation, or lists of other models to be positioned relative to the updated view.

If the game and rendering code are running in parallel, it is important that the parameter generation function does not reference any game state to avoid race conditions.

View bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |IRRRRRRRRR------|
GPU :                |--GGGGGGGGGG----|
VID :                |                |VVVVVVVVVVVVVVVV|
                      .................. latency 16 – 32 milliseconds

The input is only sampled once per frame, but it is simultaneously used by both the simulation task and the rendering task. Some input processing work is now duplicated by the simulation task and the render task, but it is generally minimal.

The latency for parameters produced by the generator function is now reduced, but other interactions with the world, like muzzle flashes and physics responses, remain at the same latency as the standard model.

A modified form of view bypass could allow tile based GPUs to achieve similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to achieve 100% utilization without pipeline bubbles by the following steps:

Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL has only the deprecated display list functionality to approximate this, but a control extension could be formulated.

All calculations that depend on the view matrix must reference it independently from a buffer object, rather than from inline parameters or as a composite model-view-projection (MVP) matrix.

After all commands have been issued and the next frame has started, sample the user input, run it through the parameter generator, and put the resulting view matrix into the buffer object for referencing by the draw commands.

Kick off the draw command execution.

Tiler optimized view bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |IRRRRRRRRRR-----|I
GPU :                |                |-GGGGGGGGGG-----|
VID :                |                |                |VVVVVVVVVVVVVVVV|
                                       .................. latency 16 – 32 milliseconds

Any view frustum culling that was performed to avoid drawing some models may be invalid if the new view matrix has changed substantially enough from what was used during the rendering task. This can be mitigated at some performance cost by using a larger frustum field of view for culling, and hardware clip planes based on the culling frustum limits can be used to guarantee a clean edge if necessary. Occlusion errors from culling, where a bright object is seen that should have been occluded by an object that was incorrectly culled, are very distracting, but a temporary clean encroaching of black at a screen edge during rapid rotation is almost unnoticeable.

Time warping

If you had perfect knowledge of how long the rendering of a frame would take, some additional amount of latency could be saved by late frame scheduling the entire rendering task, but this is not practical due to the wide variability in frame rendering times.

Late frame input sampled view bypass:
CPU1:ISSSSSSSSS------|
CPU2:                |----IRRRRRRRRR--|
GPU :                |------GGGGGGGGGG|
VID :                |                |VVVVVVVVVVVVVVVV|
                          .............. latency 12 – 28 milliseconds

However, a post processing task on the rendered image can be counted on to complete in a fairly predictable amount of time, and can be late scheduled more easily. Any pixel on the screen, along with the associated depth buffer value, can be converted back to a world space position, which can be re-transformed to a different screen space pixel location for a modified set of view parameters.

After drawing a frame with the best information at your disposal, possibly with bypassed view parameters, instead of displaying it directly, fetch the latest user input, generate updated view parameters, and calculate a transformation that warps the rendered image into a position that approximates where it would be with the updated parameters. Using that transform, warp the rendered image into an updated form on screen that reflects the new input. If there are two dimensional overlays present on the screen that need to remain fixed, they must be drawn or composited in after the warp operation, to prevent them from incorrectly moving as the view parameters change.

Late frame scheduled time warp:
CPU1:ISSSSSSSSS------|
CPU2:                |RRRRRRRRRR----IR|
GPU :                |-GGGGGGGGGG----G|
VID :                |                |VVVVVVVVVVVVVVVV|
                                    .... latency 2 – 18 milliseconds

If the difference between the view parameters at the time of the scene rendering and the time of the final warp is only a change in direction, the warped image can be almost exactly correct within the limits of the image filtering. Effects that are calculated relative to the screen, like depth based fog (versus distance based fog) and billboard sprites will be slightly different, but not in a manner that is objectionable.

If the warp involves translation as well as direction changes, geometric silhouette edges begin to introduce artifacts where internal parallax would have revealed surfaces not visible in the original rendering. A scene with no silhouette edges, like the inside of a box, can be warped significant amounts and display only changes in texture density, but translation warping realistic scenes will result in smears or gaps along edges. In many cases these are difficult to notice, and they always disappear when motion stops, but first person view hands and weapons are a prominent case. This can be mitigated by limiting the amount of translation warp, compressing or making constant the depth range of the scene being warped to limit the dynamic separation, or rendering the disconnected near field objects as a separate plane, to be composited in after the warp.

If an image is being warped to a destination with the same field of view, most warps will leave some corners or edges of the new image undefined, because none of the source pixels are warped to their locations. This can be mitigated by rendering a larger field of view than the destination requires; but simply leaving unrendered pixels black is surprisingly unobtrusive, especially in a wide field of view HMD.

A forward warp, where source pixels are deposited in their new positions, offers the best accuracy for arbitrary transformations. At the limit, the frame buffer and depth buffer could be treated as a height field, but millions of half pixel sized triangles would have a severe performance cost. Using a grid of triangles at some fraction of the depth buffer resolution can bring the cost down to a very low level, and the trivial case of treating the rendered image as a single quad avoids all silhouette artifacts at the expense of incorrect pixel positions under translation.

Reverse warping, where the pixel in the source rendering is estimated based on the position in the warped image, can be more convenient because it is implemented completely in a fragment shader. It can produce identical results for simple direction changes, but additional artifacts near geometric boundaries are introduced if per-pixel depth information is considered, unless considerable effort is expended to search a neighborhood for the best source pixel.

If desired, it is straightforward to incorporate motion blur in a reverse mapping by taking several samples along the line from the pixel being warped to the transformed position in the source image.

Reverse mapping also allows the possibility of modifying the warp through the video scanout. The view parameters can be predicted ahead in time to when the scanout will read the bottom row of pixels, which can be used to generate a second warp matrix. The warp to be applied can be interpolated between the two of them based on the pixel row being processed. This can correct for the “waggle” effect on a progressively scanned head mounted display, where the 16 millisecond difference in time between the display showing the top line and bottom line results in a perceived shearing of the world under rapid rotation on fast switching displays.

Continuously updated time warping

If the necessary feedback and scheduling mechanisms are available, instead of predicting what the warp transformation should be at the bottom of the frame and warping the entire screen at once, the warp to screen can be done incrementally while continuously updating the warp matrix as new input arrives.

Continuous time warp:
CPU1:ISSSSSSSSS------|
CPU2:                |RRRRRRRRRRR-----|
GPU :                |-GGGGGGGGGGGG---|
WARP:                |               W| W W W W W W W W|
VID :                |                |VVVVVVVVVVVVVVVV|
                                     ... latency 2 – 3 milliseconds for 500hz sensor updates

The ideal interface for doing this would be some form of “scanout shader” that would be called “just in time” for the video display. Several video game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers ranging from half a scan line to several scan lines that were filled up in this manner.

Without new hardware support, it is still possible to incrementally perform the warping directly to the front buffer being scanned for video, and not perform a swap buffers operation at all.

A CPU core could be dedicated to the task of warping scan lines at roughly the speed they are consumed by the video output, updating the time warp matrix each scan line to blend in the most recently arrived sensor information.

GPUs can perform the time warping operation much more efficiently than a conventional CPU can, but the GPU will be busy drawing the next frame during video scanout, and GPU drawing operations cannot currently be scheduled with high precision due to the difficulty of task switching the deep pipelines and extensive context state. However, modern GPUs are beginning to allow compute tasks to run in parallel with graphics operations, which may allow a fraction of a GPU to be dedicated to performing the warp operations as a shared parameter buffer is updated by the CPU.

Discussion

View bypass and time warping are complementary techniques that can be applied independently or together. Time warping can warp from a source image at an arbitrary view time / location to any other one, but artifacts from internal parallax and screen edge clamping are reduced by using the most recent source image possible, which view bypass rendering helps provide.

Actions that require simulation state changes, like flipping a switch or firing a weapon, still need to go through the full pipeline for 32 – 48 milliseconds of latency based on what scan line the result winds up displaying on the screen, and translational information may not be completely faithfully represented below the 16 – 32 milliseconds of the view bypass rendering, but the critical head orientation feedback can be provided in 2 – 18 milliseconds on a 60 hz display. In conjunction with low latency sensors and displays, this will generally be perceived as immediate. Continuous time warping opens up the possibility of latencies below 3 milliseconds, which may cross largely unexplored thresholds in human / computer interactivity.

Conventional computer interfaces are generally not as latency demanding as virtual reality, but sensitive users can tell the difference in mouse response down to the same 20 milliseconds or so, making it worthwhile to apply these techniques even in applications without a VR focus.

A particularly interesting application is in “cloud gaming”, where a simple client appliance or application forwards control information to a remote server, which streams back real time video of the game. This offers significant convenience benefits for users, but the inherent network and compression latencies makes it a lower quality experience for action oriented titles. View bypass and time warping can both be performed on the server, regaining a substantial fraction of the latency imposed by the network. If the cloud gaming client was made more sophisticated, time warping could be performed locally, which could theoretically reduce the latency to the same levels as local applications, but it would probably be prudent to restrict the total amount of time warping to perhaps 30 or 40 milliseconds to limit the distance from the source images.

Acknowledgements

Zenimax for allowing me to publish this openly.

Hillcrest Labs for inertial sensors and experimental firmware.

Emagin for access to OLED displays.

Oculus for a prototype Rift HMD.

Nvidia for an experimental driver with access to the current scan line number.

Kara Swisher interview of Jack Dorsey

2013-02-12 08:00:00

This is a transcript of the Kara Swisher / Jack Dorsey interview from 2/12/2019, made by parsing the original Tweets because I wanted to be able to read this linearly. There's a "moment" that tries to track this, but since it doesn't distinguish between sub-threads in any way, you can't tell the difference between end of a thread and a normal reply. This linearization of the interview marks each thread break with a page break and provides some context from upthread where relevant (in grey text).

Kara: Here in my sweatiest @soulcycle outfit for my Twitterview with @jack with @Laur_Katz at the ready @voxmediainc HQ. Also @cheezit acquired. #karajack

Kara: Oh hai @jack. Let’s set me set the table. First, I am uninterested in beard amulets or weird food Mark Zuckerberg served you (though WTF with both for my personal self). Second, I would appreciate really specific answers.

Jack: Got you. Here’s my setup. I work from home Tuesdays. In my kitchen. Tweetdeck. No one here with me, and no one connected to my tweetdeck. Just me focused on your questions!

Kara: Great, let's go

Jack: Ready


Kara: As @ashleyfeinberg wrote: “press him for a clear, unambiguous example of nearly anything, and Dorsey shuts down.” That is not unfair characterization IMHO. Third, I will thread in questions from audience, but to keep this non chaotic, let’s stay in one reply thread.

Jack: Deal


Kara: To be clear with audience, there is not a new event product, a glass house, if you will, where people can see us but not comment. I will ask questions and then respond to @jack answers. So it could be CHAOS.

Jack: To be clear, we’re interested in an experience like this. Nothing built yet. This gives us a sense of what it would be like, and what we’d need to focus on. If there’s something here at all!

Kara: Well an event product WOULD BE NICE. See my why aren't you moving faster trope.


Kara: Overall here is my mood and I think a lot of people when it comes to fixing what is broke about social media and tech: Why aren’t you moving faster? Why aren’t you moving faster? Why aren’t you moving faster?

Jack: A question we ask ourselves all the time. In the past I think we were trying to do too much. We’re better at prioritizing by impact now. Believe the #1 thing we should focus on is someone’s physical safety first. That one statement leads to a lot of ramifications.

Kara: It seems twitter has been stuck in a stagnant phase of considering/thinking about the health of the conversation, which plays into safety, for about 18-24 months. How have you made actual progress? Can you point me to it SPECIFICALLY?


Kara: You know my jam these days is tech responsibility. What grade do you gave Silicon Valley? Yourself?

Jack: Myself? C. We’ve made progress, but it has been scattered and not felt enough. Changing the experience hasn’t been meaningful enough. And we’ve put most of the burden on the victims of abuse (that’s a huge fail).

Kara: Well that is like telling me I am sick and am responsible for fixing it. YOU made the product, YOU run the platform. Saying it is a huge fail is a cop out to many. It is to me

Jack: Putting the burden on victims? Yes. It’s recognizing that we have to be proactive in enforcement and promotion of healthy conversation. This is our first priority in #health. We have to change a lot of the fundamentals of product to fix.

Kara: please be specific. I see a lot of beard-stroking on this (no insult to your Lincoln jam, but it works). WHAT are you changing? SPECIFICALLY.

Jack: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.

Kara: Ok name three initiatives.


Jack: Myself? C. We’ve made progress, but it has been scattered and not felt enough. Changing the experience hasn’t been meaningful enough. And we’ve put most of the burden on the victims of abuse (that’s a huge fail).

Kara: Also my son gets a C in coding and that is NO tragedy. You getting one matters a lot.

Jack: Agree it matters a lot. And it’s the most important thing we need to address and fix. I’m stating that it’s a fail of ours to put the majority of burden on victims. That’s how the service works today.

Kara: Ok but I really want to drill down on HOW. How much downside are you willing to tolerate to balance the good that Twitter can provide? Be specific

Jack: This is exactly the balance we have to think deeply about. But in doing so, we have to look at how the product works. And where abuse happens the most: replies, mentions, search, and trends. Those are the shared spaces people take advantage of

Kara: Well, WHERE does abuse happen most

Jack: Within the service? Likely within replies. That’s why we’ve been more aggressive about proactively downranking behind interstitials, for example.

Kara: Why not just be more stringent on kicking off offenders? It seems like you tolerate a lot. If Twitter ran my house, my kids would be eating ramen, playing Red Dead Redemption 2 and wearing filthy socks

Jack: We action all we can against our policies. Most of our system today works reactively to someone reporting it. If they don’t report, we don’t see it. Doesn’t scale. Hence the need to focus on proactive

Kara: But why did you NOT see it? It seems pretty basic to run your platform with some semblance of paying mind to what people are doing on it? Can you give me some insight into why that was not done?

Jack: I think we tried to do too much in the past, and that leads to diluted answers and nothing impactful. There’s a lot we need to address globally. We have to prioritize our resources according to impact. Otherwise we won’t make much progress.

Kara: Got it. But do you think the fact that you all could not conceive of what it is to feel unsafe (women, POC, LGBTQ, other marginalized people) could be one of the issues? (new topic soon)

Jack: I think it’s fair and real. No question. Our org has to be reflective of the people we’re trying to serve. One of the reason we established the Trust and Safety council years ago, to get feedback and check ourselves.

Kara: Yes but i want THREE concrete examples.


Jack: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.

Kara: Or maybe, tell me what you think the illness is you are treating? I think you cannot solve a disease without knowing that. Or did you create the virus?

Jack: Good question. This is why we’re focused on understanding what conversational health means. We see a ton of threats to health in digital conversation. We’re focuse first on off-platform ramifications (physical safety). That clarifies priorities of policy and enforcement.

Kara: I am still confused. What the heck is "off-platform ramifications"? You are not going to have a police force, right? Are you 911?

Jack: No, not a police force. I mean we have to consider first and foremost what online activity does to impact physical safety, as a way to prioritize our efforts. I don’t think companies like ours have admitted or focused on that enough.

Kara: So you do see the link between what you do and real life danger to people? Can you say that explicitly? I could not be @finkd to even address the fact that he made something that resulted in real tragedy.

Jack: I see the link, and that’s why we need to put physical safety above all else. That’s what we’re figuring out how to do now. We don’t have all the answers just yet. But that’s the focus. I think it clarifies a lot of the work we need to do. Not all of it of course.

Kara: I grade you all an F on this and that's being kind. I'm not trying to be a jackass, but it's been a very slow roll by all of you in tech to pay attention to this. Why do you think that is? I think it is because many of the people who made Twitter never ever felt unsafe.

Jack: Likely a reason. I’m certain lack of diversity didn’t help with empathy of what people experience on Twitter every day, especially women.

Kara: And so to end this topic, I will try again. Please give me three concrete things you have done to fix this. SPECIFIC.

Jack: 1. We have evolved our polices. 2. We have prioritized proactive enforcement to remove burden from victims 3. We have given more control in product (like mute of accounts without profile pics or associated phone/emails) 4. Much more aggressive on coordinated behavior/gaming

Kara: 1. WHICH? 2. HOW? 3. OK, MUTE BUT THAT WAS A WHILE AGO 4. WHAT MORE? I think people are dying for specifics.

Jack: 1. Misgendering policy as example. 2. Using ML to downrank bad actors behind interstitials 3. Not too long ago, but most of our work going forward will have to be product features. 4. Not sure the question. We put an entire model in place to minimize gaming of system.

Kara: thx. I meant even more specifics on 4. But see the Twitter purge one.

Jack: Just resonded to that. Don’t see the twitter purge one

Kara: I wanted to get off thread with Mark added! Like he needs more of me.

Jack: Does he check this much?

Kara: No, he is busy fixing Facebook. NOT! (he makes you look good)

Kara: I am going to start a NEW thread to make it easy for people to follow (@waltmossberg just texted me that it is a "chaotic hellpit"). Stay in that one. OK?

Jack: Ok. Definitely not easy to follow the conversation. Exactly why we are doing this. Fixing stuff like this will help I believe.

Kara: Yeah, it's Chinatown, Jake.


Jack: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.

Jack: Second, we’re constantly evolving our policies to address the issues we see today. We’re rooting them in fundamental human rights (UN) and putting physical safety as our top priority. Privacy next.

Kara: When you say physical safety, I am confused. What do you mean specifically? You are not a police force. In fact, social media companies have built cities without police, fire departments, garbage pickup or street signs. IMHO What do you think of that metaphor?

Jack: I mean off platform, offline ramifications. What people do offline with what they see online. Doxxing is a good example which threatens physical safety. So does coordinate harassment campaigns.

Kara: So how do you stop THAT? I mean regular police forces cannot stop that. It seems your job is not to let it get that far in the first place.

Jack: Exactly. What can we do within the product and policy to lower probability. Again, don’t think we or others have worked against that enough.


Kara: Ok, new one @jack

What do you think about twitter breaks and purges. Why do you think that is? I can’t say I’ve heard many people say they feel “good” after not being on twitter for a while: https://twitter.com/TaylorLorenz/status/1095039347596898305

Jack: Feels terrible. I want people to walk away from Twitter feeling like they learned something and feeling empowered to some degree. It depresses me when that’s not the general vibe, and inspires me to figure it out. That’s my desire

Kara: But why do they feel that way? You made it.

Jack: We made something with one intent. The world showed us how it wanted to use it. A lot has been great. A lot has been unexpected. A lot has been negative. We weren’t fast enough to observe, learn, and improve


Kara: Ok, new one @jack

Kara: In that vein, how does it affect YOU?

Jack: I also don’t feel good about how Twitter tends to incentivize outrage, fast takes, short term thinking, echo chambers, and fragmented conversation and consideration. Are they fixable? I believe we can do a lot to address. And likely have to change more fundamentals to do so.

Kara: But you invented it. You can control it. Slowness is not really a good excuse.

Jack: It’s the reality. We tried to do too much at once and were not focused on what matters most. That contributes to slowness. As does our technology stack and how quickly we can ship things. That’s improved a lot recently


Kara: Ok trying AGAIN @jack in another new thread! This one about @realDonaldTrump:

We know a lot more about what Donald Trump thinks because of Twitter, and we all have mixed feelings about that.

Kara: Have you ever considered suspending Donald Trump? His tweets are somewhat protected because he’s a public figure, but would he have been suspended in the past if he were a “regular” user?

Jack: We hold all accounts to the same terms of service. The most controversial aspect of our TOS is the newsworthy/public interest clause, the “protection” you mention. That doesn’t extend to all public figures by default, but does speak to global leaders and seeing how they think.

Kara: That seems questionable to a lot of people. Let me try it a different way: What historic newsworthy figure would you ban? Is someone bad enough to ban. Be specific. A name.

Jack: We have to enforce based on our policy and what people do on our service. And evolve it with the current times. No way I can answer that based on people. Has to be focused on patterns of how people use the technology.

Kara: Not one name? Ok, but it is a copout imho. I have a long list.

Jack: I think it’s more durable to focus on use cases because that allows us to act broader. Likely that these aren’t isolated cases but things that spread

Kara: it would be really great to get specific examples as a lot of what you are doing appears incomprehensible to many.


Kara: Ok trying AGAIN @jack in another new thread! This one about @realDonaldTrump:

Kara: And will Twitter’s business/engagement suffer when @realDonaldTrump is no longer President?

Jack: I don’t believe our service or business is dependent on any one account or person. I will say the number of politics conversations has significantly increased because of it, but that’s just one experience on Twitter. There are multiple Twitters, all based on who you follow.

Kara: Ok new question (answer the newsworthy historical figure you MIGHT ban pls): Single biggest improvement at Twitter since 2016 that signals you’re ready for the 2020 elections?

Jack: Our work against automations and coordinated campaigns. Partnering with government agencies to improve communication around threats

Kara: Can you give a more detailed example of that that worked?

Jack: We shared a retro on 2018 within this country, and tested a lot with the Mexican elections too. Indian elections coming up. In mid-terms we were able to monitor efforts to disrupt both online and offline and able to stop those actions on Twitter.


Kara: Ok new question (answer the newsworthy historical figure you MIGHT ban pls): Single biggest improvement at Twitter since 2016 that signals you’re ready for the 2020 elections?

Kara: What confidence should we have that Russia or other state-sponsored actors won’t be able to wreak havoc on next year’s elections?

Jack: We should expect a lot more coordination between governments and platforms to address. That would give me confidence. And have some skepticism too. That’s healthy. The more we can do this work in public and share what we find, the better

Kara: I still am dying for specifics here. [meme image: Give me some specifics. I love specifics, the specifics were the best part!]


Jack: I think it’s more durable to focus on use cases because that allows us to act broader. Likely that these aren’t isolated cases but things that spread

Kara: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?

Jack: We want to be valuable to people daily. Not monthly. It’s a higher bar for ourselves. Sure, it looks like a smaller absolute number, but the folks we have using Twitter are some of the most influential in the world. They drive conversation. We belevie we can best grow this.

Kara: Ok, then WHO is the most exciting influential on Twitter right now? BE SPECIFIC

Jack: To me personally? I like how @elonmusk uses Twitter. He’s focused on solving existential problems and sharing his thinking openly. I respect that a lot, and all the ups and downs that come with it

Kara: What about @AOC

Jack: Totally. She’s mastering the medium

Kara: She is well beyond mastering it. She speaks fluent Twitter.

Jack: True

Kara: Also are you ever going to hire someone to effectively be your number 2?

Jack: I think it’s better to spread that responsibility across multiple people. It creates less dependencies and the company gets more options around future leadership


Kara: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?

Kara: Also: How close were you to selling Twitter in 2016? What happened?

What about giving the company to a public trust per your NYT discussion.

Jack: We ultimately decided we were better off independent. And I’m happy we did. We’ve made a lot of progress since that point. And we got a lot more focused. Definitely love the idea of opening more to 3rd parties. Not sure what that looks like yet. Twitter is close to a protocol.

Kara: Chop chop on the other answers! I have more questions! If you want to use this method, quicker!

Jack: I’m moving as fast as I can Kara

Kara: Clip clop!


Kara: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?

Kara: also: Is twitter still considering a subscription service? Like “Twitter Premium” or something?

Jack: Always going to experiment with new models. Periscope has super hearts, which allows us to learn about direct contribution. We’d need to figure out the value exchange on subscription. Has to be really high for us to charge directly


Jack: Totally. She’s mastering the medium

Kara: Ok, last ones are about you and we need to go long because your system here it confusing says the people of Twitter:

  1. What has been Twitter’s biggest missed opportunity since you came back as CEO?

Jack: Focus on conversation earlier. We took too long to get there. Too distracted.

Kara: By what? What is the #1 thing that distracted you and others and made this obvious mess via social media?

Jack: Tried to do too much at once. Wasn’t focused on what our one core strength was: conversation. That lead to really diluted strategy and approach. And a ton of reactiveness.

Kara: Speaking of that (CONVERSATION), let's do one with sounds soon, like this

https://www.youtube.com/watch?v=oiJkANps0Qw


Kara: She is well beyond mastering it. She speaks fluent Twitter.

Jack: True

Kara: Why are you still saying you’re the CEO of two publicly traded companies? What’s the point in insisting you can do two jobs that both require maximum effort at the same time?

Jack: I’m focused on building leadership in both. Not my desire or ambition to be CEO of multiple companies just for the sake of that. I’m doing everything I can to help both. Effort doesn’t come down to one person. It’s a team


Kara: LAST Q: For the love of God, please do Recode Decode podcast with me soon, because analog talking seems to be a better way of asking questions and giving answers. I think Twitter agrees and this has shown how hard this thread is to do. That said, thx for trying. Really.

Jack: This thread was hard. But we got to learn a ton to fix it. Need to make this feel a lot more cohesive and easier to follow. Was extremely challenging. Thank you for trying it with me. Know it wasn’t easy. Will consider different formats!

Kara: Make a glass house for events and people can watch and not throw stones. Pro tip: Twitter convos are wack

Jack: Yep. And they don’t have to be wack. Need to figure this out. This whole experience is a problem statement for what we need to fix


Jack: This thread was hard. But we got to learn a ton to fix it. Need to make this feel a lot more cohesive and easier to follow. Was extremely challenging. Thank you for trying it with me. Know it wasn’t easy. Will consider different formats!

Kara: My kid is hungry and says that you should do a real interview with me even if I am mean. Just saying.

Jack: I don’t think you’re mean. Always good to experiment.

Kara: Neither does my kid. He just wants to go get dinner

Jack: Go eat! Thanks, Kara

Are closed social networks inevitable?

2010-01-01 08:00:00

This is an archive of an old Google Buzz conversation (circa 2010?) on a variety of topics, including whether or not it's inevetible that a closed platform will dominate social.

Piaw: Social networks will be dominated primarily by network effects. That means in the long run, Facebook will dominate all the others.

Rebecca: ... which is also why no one company should dominate it. "The Social Graph" and its associated apps should be like the internet, distributed and not confined to one company's servers. I wish the narrative surrounding this battle was centered around this idea, and not around the whole Silicon Valley "who is the most genius innovator" self-aggrandizing unreality field. Thank god Tim Berners Lee wasn't from Silicon Valley, or we wouldn't have the Internet the way we know it in the first place.

I suppose I shouldn't be being so snarky, revealing how much I hate your narratives sometimes. But I think for once this isn't, as it usually is, merely harmlessly cute and endearing - you all collectively screwing up something actually important, and I'm annoyed.

Piaw: The way network effects work, one company will control it. It's inevitable.

Rebecca: No it is not inevitable! What is inevitable is either that one company controls it or that no company controls it. If you guys had been writing the narrative of the invention of the internet you would have been arguing that it was inevitable that the entire internet live on one companies servers, brokered by hidden proprietary protocols. And obviously that's just nuts.

Piaw: I see, the social graph would be collectively owned. That makes sense, but I don't see why Facebook would have an incentive to make that happen.

Rebecca: Of course not! That's why I'm biting my fingernails hoping for some other company to be the white knight and ride up and save the day, liberating the social graph (or more precisely, the APIs of the apps that live on top of them) from any hope of control by a single company. Of course, there isn't a huge incentive for any other company to do it either --- the other companies are likely to just gaze enviously at Facebook and wish they got there first. Tim Berners Lee may have done great stuff for the world, but he didn't get rich or return massive value to shareholders, so the narrative of the value he created isn't included in the standard corporate hype machine or incentives.

Google is the only company with the right position, somewhat appropriate incentives, and possibly the right internal culture to be the "Tim Berners Lee" of the new social internet. That's what I was hoping for, and I'm am more than a bit bummed they don't seem to be stepping up to the plate in an effective way in this case, especially since they are doing such a fabulous job in an analogous role with Android.

Rebecca: There is a worldview lurking behind these comments, which perhaps I should try to explain. I'm been nervous about this because it contains some strange ideas, but I'm wondering what you think.

Here's a very strange assertion: Mark Zuckerberg is not a capitalist, and therefore should not be judged by capitalist logic. Before you dismiss me as nuts, stop and think for a minute. What is the essential property that makes someone a capitalist?

For instance, when Nike goes to Indonesia and sets up sweatshops there, and if communists, unhappy with low pay & terrible conditions, threaten to rebel, they are told "this is capitalism, and however noxious it is, capitalism will make us rich, so shut up and hold your peace!" What are they really saying? Nike brings poor people sewing machines and distribution networks so they can make and sell things they could not make and sell otherwise, so they become more productive and therefore richer. The productive capacity is scarce, and Nike is bringing to Indonesia a piece of this scarce resource (and taking it away from other people, like American workers.) So Indonesia gets richer, even if sweatshop workers suffer for a while.

So is Mark Zuckerberg bringing to American workers a piece of scarce productive capacity, and therefore making American workers richer? It is true he is creating for people productive capacity they did not have before --- the possibility of writing social apps, like social games. This is "innovation" and it does make us richer.

But it is not wealth that follows the rules of capitalist logic. In particular, this kind of wealth of productive capacity, unlike the wealth created by setting up sewing machines, does not have the kind of inherent scarcity that fundamentally drives capitalist logic. Nike can set up its sewing machines for Americans, or Indonesians, but not for everyone at once. But Tim Berners Lee is not forced to make such choices -- he can design protocols that allow everyone everywhere to produce new things, and he need not restrict how they choose to do it.

But -- here's the key point -- though there is no natural scarcity, there may well be "artificial" scarcity. Microsoft can obfuscate Windows API's, and bind IE to Windows. Facebook can close the social graph, and force all apps to live on its servers. "Capitalists" like these can then extract rents from this artificial scarcity. They can use the emotional appeal of capitalist rhetoric to justify their rent-seeking. People are very conditioned to believe that when local companies get rich American workers in general will also get rich -- it works for Indonesia so why won't it work for us? And Facebook and Microsoft employees are getting richer. QED.

But they aren't getting richer in the same way that sweatshop employees are getting richer. The sweatshop employees are becoming more productive than they would otherwise be, in spite of the noxious behavior of the capitalists. But if Zuckerberg or Gates behaves noxiously, by creating a walled garden, this may make his employees richer, in the sense of giving them more money, but "more money" is not the same as "more wealth." More wealth means more productive capacity for workers, not more payout to individual employees. In a manufacturing economy those are linked, so people forget they are not the same.

And in fact, shenanigans like these reduce rather than increase the productive capacity available to everyone, by creating an artificial scarcity of a kind of productive tool that need not be scarce at all, just for the purpose of extracting rents from them. No real wealth comes from this extraction. In aggregate it makes us poorer rather than richer.

Here's where the kind of stunt that Google pulled with Android, that broke the iPhone's lock, even if it made Google no money, should be seen as the real generator of wealth, even if it is unclear whether it made any money for Google's shareholders. Wealth means I can build something I couldn't build before -- if I want I can install a Scheme or Haskell interpreter on an Android phone, which I am forbidden to put on the iPhone. It means a lot to me! Google's support of Firefox & Chrome, which sped the adoption of open web standards and HTML5, also meant a huge amount to me. I'm an American worker, and I am made richer by Google, in the sense of having more productive capacity available to me, even if Google wasn't that great for my personal wealth in the sense of directly increasing my salary.

Rebecca: (That idea turned out to be sortof shockingly long comment by itself, and on review the last two paragraphs of the original comment were a slightly different thought, so I'm breaking them into a different section.)

I'm upset that Google is getting a lot of anti-trust type flak, when I think the framework of anti-trust is just the wrong way to think. This battle isn't analogous to Roosevelt's big trust busting battles; it is much more like the much earlier battles at the beginning of the industrial revolution of the Yankee merchants against the old agricultural, aristocratic interests, which would have squelched industrialization. And Google is the company that has been most consistently on the side of really creating wealth, by not artificially limiting the productivity they make available for developers everywhere. Other companies, like Microsoft or Facebook, though they are "new economy," though they are "innovative," though they seem to generate a lot of "wealth" in the form of lots of money, really are much more like the old aristocrats rather than the scrappy new Yankees. In many ways they are slowing down the real revolution, not speeding it up.

I've been reluctant to talk too much about these ideas, because I'm anxious about being called a raving commie. But I'm upset that Google is the target of misguided anti-trust logic, and it might be sufficiently weakened that it can't continue to be the bulwark of defense against the real "new economy" abuses that it has been for the last half-decade. That defense has meant a lot to independent developers, and I would hate to see it go away.

Phil: +100, Rebecca. It is striking how little traction the rhetoric of software freedom has here in Silicon Valley relative to pretty much everywhere else in the world.

Rebecca: Thanks - I worry whether my ultra-long comments are spam and its good to hear if someone appreciates them. I have difficulty making my ideas short, but I'm using this Buzz conversation to practice.

I'm am not entirely happy with the way the "software freedom" crowd is pitching their message. I had an office down the hall from Richard Stallman for a while, and I was often harangued by him. However, I thought his message was too narrow and radicalized. But on the other hand, when I thought about it hard, I also realized that in many ways it was not radical enough...

Why are we talking about freedom? To motivate this, I sometimes tell a little story about the past. When I was young my father read to me "20,000 Leagues Under the Sea," advertising it as a great classic of futuristic science fiction. Unfortunately, I was unimpressed. It didn't seem "futuristic" at all: it seemed like an archaic fantasy. Why? Certainly it was impressive that an author in 1869 correctly predicted that people would ride in submarines under the sea. But it didn't seem like an image of the future, or even the past, at all. Why not? Because the person riding around on the submarine under the sea was a Victorian gentleman surrounded by appropriately deferential Victorian servants.

Futurists consistently get their stories wrong in a particular way: when they say that technology changes the world, they tell stories of fabulous gadgets that will enable people to do new and exciting things. They completely miss that this is not really what "change" -- serious, massive, wrenching, social change - really is. When technology truly enables dreams of change, it doesn't mean it enables aristocrats to dream about riding around under the sea. What it means is that enables the aristocrat's butler to dream of not being a butler any more --- a dream of freedom not through violence or revolution, but through economic independence. A dream of technological change -- really significant technological change -- is not a dream of spiffy gadgets, it is a dream of freedom, of social & economic liberation enabled by technology.

Lets go back to our Indonesian sweatshop worker. Even though in many ways the sweatshop job liberates her --- from backbreaking work on a farm, a garbage dump, or in brothels -- she is also quite enslaved. Why? She can sew, let us say, high-end basketball sneakers, which Nike sells for $150 apiece -- many, many times her monthly or even yearly wage. Why is she getting a small cut of the profit from her labors? Because she is dependent on the productive capacity that Nike is providing to her, so bad as the deal is, it is the best she can get.

This is where new technology comes in. People talk about the information revolution as if it is about computers, or software, but I would really say it is about society figuring out (slowly) how to automate organization. We have learned to effectively automate manufacturing, but not all of the work of a modern economy is manufacturing. What is the service Nike provides that makes this woman dependent on such a painful deal? Some part of this service is the manufacturing capacity they provide -- the sewing machine -- but sewing machines are hardly expensive or unobtainable, even for poor people. The much bigger deal is the organizational services Nike offers: all the branding, logistics, supply-chain management and retail services that go into getting a sneaker sewn in Indonesia into the hands of an eager consumer in America. One might argue that Nike is performing these services inefficiently, so even if our seamstress is effective and efficient, Nike must take an unreasonably large cut of the profits from the sale of the sneaker to support the rest of this inefficient, expensive, completely un-automated effort.

That's where technological change comes in. Slowly, it is making it possible for all these organizational services to be made more automated, streamlined and efficient. This is really the business Google is in. It is said that Google is an "advertising" business, but to call what Google does "advertising" is to paper over the true story of the profound economic shift of which they are merely providing the opening volley.

Consider the maker of custom conference tables who recently blogged in the New York Times about Adwords (http://boss.blogs.nytimes.com/2010/12/03/adwords-and-me-exploring-the-mystery/). He said he paid Google $75,124.77 last year. What does that money represent -- what need is Google filling which is worth more than seventy thousand a year to this guy? You might say that they are capturing an advertising budget of a company, until you consider that without Google this company wouldn't exist at all. Before Google, did you regularly stumble across small businesses making custom conference tables? This is a new phenomenon! The right way to see it is that this seventy thousand isn't really a normal advertising budget -- instead, think of it as a chunk of the revenue of the generic conference table manufacturer that this person no longer has to work for. Because Google is providing for him the branding, customer management services, etc, etc that this old company used to be doing much less efficiently and creatively, this blogger has a chance to go into business for himself. He is paying Google seventy thousand a year for this privilege, but this is probably much less than the cut that was skimmed off the profits of his labors by his old management (not to mention issues of control and stifled creativity he is escaping). Google isn't selling "advertising": Google is selling freedom. Google is selling to the workers of the world the chance to rid themselves of their chains -- nonviolently and without any revolutionary rhetoric -- but even without the rhetoric this service is still about economic liberation and social change.

I feel strange when I hear Eric Schmidt talk about Google's plans for the future of their advertising business, because he seems to be telling Wall Street of a grand future where Google will capture a significant portion of Nike's advertising budget (with display ads and such). This seems like both an overambitious fantasy; and also strangely not nearly ambitious enough. For I think the real development of Google's business -- not today, not tomorrow, not next year, not even next decade, but eventually and inexorably (assuming Google survives the vicissitudes of fate and cultural decay) -- isn't that Google captures Nike's advertising budget. It is that Google captures a significant portion of Nike's entire revenue, paid to them by the workers who no longer have to work for Nike anymore, because Google or a successor in Google's tradition provides them with a much more efficient and flexible alternative vendor for the services Nike's management currently provides.

Rebecca: (Once again I looked at my comment and realized it was even more horrifyingly long. My thoughts seem short in my head, but even when I try to write them down as fast and effectively as I can, they aren't short anymore! Again, I saw the comment has two parts: first, explaining the basic idea of the "freedom" we are talking about, and second, tying it back into the context of our original discussion. So to try to be vaguely reasonable I am cutting it in two.)

I suppose Eric Schmidt will never stand in front of Wall Street and say that. When it is really true that "We will bury you!" nobody ever stands up and bangs a shoe on the table while saying it. The architects of the old "new economy" didn't say such things either: the Yankee merchants never said to their aristocratic political rivals that they intended to eventually completely dismantle their social order. In 1780 there was no discussion that foretold the destructive violence of Sherman's march to the sea. I'm not sure they knew it themselves, and if they had been told that that was a possible result of their labors they might not have wanted to hear it. The new class wanted change, they wanted opportunity, they wanted freedom, but they did not want blood! That they would be cornered into seeking it anyway would have been as horrifying to them as to anyone else. Real change is not something anyone wants to crow about --- it is too terrifying.

But it is nonetheless important to face, because in the short term this transformation is hardly inevitable or necessarily smooth. If our equivalent of Sherman's march to the sea might be in our future, we might want to think about how to manage or avoid it before it is too late.

One major difficulty, as I explained in the last comment, is that while the "automation of information," if developed properly, has the potential to break fundamental laws of the scarcity of productive capacity, and thereby free "the workers of the world", nonetheless that potential can be captured, and turned into "artificial" scarcity, which doesn't set workers free, it further enslaves them. There is also a big incentive to do this, because it is the easiest way to make massive amounts of money quickly for a person in the right place at the right time.

I see Microsoft as a company that has made a definite choice of corporate strategy to make money on "artificial scarcity." I see Google as a company that has made a similar definite choice to make money "selling freedom", specifically avoiding the tricks that create artificial scarcity, even when it doesn't help or even hurts their immediate business prospects.

And Facebook? Where is Sheryl Sandburg (apparently the architect of business development at Facebook) on this crucial question? A hundred years from now, when all your "genius" and "innovation," all the gadgets you build that so delight aristocrats, and are so celebrated by "futurists", will be all but forgotten, the choices you make on this question will be remembered. This matters.

Ms. Sandburg seems to be similarly clear on her philosophy: she simply wants as much diversity of revenue streams for Facebook as she can possibly get. It is hard to imagine an more un-ideological antithesis of Richard Stallman. Freedom or scarcity, she doesn't care: if its a way to make money, she wants it. As many different ones as possible! She wants it all! Its hard for me to get too worked up about this, especially since for other reasons I am rooting for Ms. Sandburg's success. Even so, I would prefer if it were Google in control of this technological advance, because Google's preference on this question is so much more clear and unequivocal.

I don't care who is the "genius innovator" and who is the "big loser", whether this or that company has taken up the mantle of progress or not, who is going to get rich, which company will attract all the superstars, or all the other questions that seem to you such critical matters, but I do care that your work makes progress towards realizing the potential of your technology to empower the workers of the world, rather than slowing it down or blocking it. Since Google has made clear the most unequivocal preference in the right direction on this question, that means I want Google to win. This is too important to be trusted to the control of someone ambivalent about their values, no matter how much basic sympathy I have for the pragmatic attitude. Baris Baser: +100! Liberate the social graph! I wish I could share the narrative taking place here on my buzz post, but I'll just plug it in.

Rob: Google SO dropped the ball with Orkut - they let Facebook run off with the Crown Jewels. Helder Suzuki: I believe that Facebook's dominance will eventually be challenged just like credit card companies are being today. But I think it's gonna come much quickier for Facebook.

There are lots of differences, but I like this comparison because credit card companies used great network effect to dominate and shield the market from competition. If you look at them (visa, amex, mastercard), all they have today is basically brand. Today we just know that "credit card" payment (and the margins) will be so much different in the near future.

Likewise I don't think that social graph will protect Facebook's "market" in the long run. Just like today it's incredibly easier to set a POS network compared to a few years ago, social graph is gonna be something trivial in the years to come.

Rebecca: Yay! People are reading my obscenely long and intellectual comments. Thanks guys!

Piaw: I disagree with Helder, even though I agree with Rebecca that it's better for Google to own the social graph. The magic trick that Facebook pulled off was getting the typical user to provide and upload all his/her personal information. It's incredibly hard to do that: Amazon couldn't do it, and neither could Google. I don't think it's one of those things that's technically difficult, but the social engineering required to do that demands critical mass. That's why I think that Facebook is (still) under-valued.

Rob: @Piaw - it was an accident of history I think. When Facebook started, they required a student ID to join. This made a culture of "real names" that stuck, and that no one else has been able to replicate.

Piaw: @Rob: The accident of history that's difficult to replicate is what makes Facebook such a good authentication mechanism. I would be willing to not moderate my blog, for instance, if I could make all commenters disclose their true identity. The lowest qualify arguments I've seen on Quora, for instance, were those in which one party was anonymous. Elliotte Rusty Harold: This is annoying I want to reshare Rebecca's comments. not the original post, but I can't seem to do that. :-)

Rebecca: In another conversation, someone linked a particular point in a Buzz commentary to Hacker News (http://news.ycombinator.com/item?id=1416348). I'm not sure how they did it. It was a little strange, though, because then people saw it out of context. These comments were tailored for a context.

Where do you want to share it? I'm not sure I'm ready to deal with too big an audience; there is a purpose to practicing writing and getting reactions in an obscure corner of the internet. After all, I am saying things that might be offensive or objectionable in Silicon Valley, and are, in any case, awfully forward -- it is useful to me to talk to a select group of my friends to get feedback from them on how well it does or doesn't fly. Its not like I mind being public, but I also don't mind obscurity for now.

Rebecca: Speaking of which, Piaw, I was biting my fingernails a little wondering how you would react to my way of talking about "software freedom." I've sort of thought of becoming a software freedom advocate in the tradition of Stallman or ESR, but more intellectual, with more historical perspective, and (hopefully) with less of an edge of polemical insanity. However, adding in an intellectual and historical perspective also added in the difficulty of colliding with real intellectuals and historians, which makes the whole thing fraught, so for that reason among others I've been dragging my feet.

This discussion made me dredge up this whole project, partly because I really wanted to know your reactions to it. However, you only reacted to the Facebook comments, not the more general software freedom polemic. What did you think about that?

Piaw: I mostly didn't react to the free software polemic because I agree with what you're saying. I agree that something like Adwords and Google makes possible businesses that didn't exist before. Facebook, for instance, recently showed me an ad for a Suzanne Vega concert that I definitely would not have known about but would have wanted to go if not for a schedule conflict. I want to be able to "like" that ad so that I can get Facebook to show me more ads like those!

Do I agree that the world would be a better place for Facebook's social graph to be an open system? Yes and No. In the sense of Facebook having less control, I think it would be a good thing. But do I think I want anybody to have access to it? No. People are already trained to click "OK" to whatever data access any applet wants in Facebook, and I don't need to be inundated with spam in Facebook --- one of the big reasons Facebook has so much less spam is because my friends are more embarrassed about spamming me than the average marketing person, and when they do spam me it's usually with something that I'm interested in, which makes it not spam.

But yes, I do wish my Buzz comments (and yours too) all propagated to Facebook/Friendfeed/etc. and the world was one big open community with trusted/authenticated users and it would be all spam free (or at least, I get to block anonymous commenters who are unauthenticated). Am I holding my breath for that? No.

I am grateful that Facebook has made a part of the internet (albeit a walled garden part) fully authenticated and therefore much more useful. I think most people don't understand how important that is, and how powerful that is, and that this bit is what makes Facebook worth whatever valuation Wall Street puts on it.

Baris: Piaw, a more fundamental question lurks within this discussion. Ultimately, will people gravitate toward others with similar interests and wait for resources to grow there (Facebook,) or go where the resources are mature, healthy, and growing fast, and wait for everyone else to arrive (Google?)

Will people ultimately go to Google where amazing technology currently exists and will probably magnify, given the current trend (self driving cars, facial recognition, voice recognition, realtime language translation, impeccable geographic data, mobile OS with a bright future, unparalleled parallel computing, etc..) or join their friends first at the current largest social network, Facebook, and wait for the technology to arrive there?

A hypothetical way of looking at this: Will people move to a very big city and wait for it to be an amazing city, or move to an already amazing city and wait for everyone else to follow suit? Or are people ok with a bunch of amazing small cities?

Piaw: Baris, I don't think you've got the analogy fully correct. The proper analogy is this: Would you prefer to live in a small neighborhood where you sometimes have to go a long way to get what you want/are interested in but is relatively crime free, or would you like to live in a big city where it's easy to get what you want but you get tons of spam and occasionally someone comes in and breaks into your house?

The world obviously has both types of people, which is why suburbs and big cities both exist.

Baris: "tons of spam and occasionally someone comes in and breaks into your house?" I think this is a bit too draconian/general though... going with this analogy, I think becomes a bit more subjective, i.e. really depends on who you are in that city, where you live, what you own, how carefree you live your life, and so forth.

Piaw: Right. And Facebook has successfully designed a web-site around this ego-centricity. You can be the star of your tiny town by selectively picking your friends, or you can be the hub of a giant city and accept everyone as a friend. If the latter, then you gave up your privacy when your "friend" posts compromising pictures of you that gets you in trouble with your employer.

Nick: Google is the only company with the right position, somewhat appropriate incentives, and possibly the right internal culture to be the "Tim Berners Lee" of the new social internet.

I'd agree that Google hasn't done well at social, but surely are better than that!

Rebecca: Oh, you aren't impressed with Tim Berner Lee's work? Was it the original HTML standard you didn't like, or the more recent W3C stuff? I would admit there is stuff to complain about about both of them.

Nick: It seems to me that TBL got lucky. His original work on the WWW was good, but I think it is difficult to argue he was responsible for its success - certainly no more than someone like Marc Andreessen, who has a pattern of success that repeated after his initial success with Mosaic.

Rebecca: @Piaw (a little ways back) So you found my free software polemic so unobjectionable as to be barely worth comment? Wasn't it a little intellectually radical, with all that "not a capitalist" and "change in the nature of scarcity" stuff? When I told Lessig parts of basic story (not in Google context, because it was many years ago), and asked him for advice about how to talk to economists, he warned me that the words I was using contain so many warning bells of crackpot intellectual radicalism that economists would immediately write me off for using them without any further consideration.

It never ceases to amaze me how engineers will swallow shockingly strange ideas without a peep. I suppose in the company of Stallman and ESR, I am a raging intellectual conservative and pragmatist, and since engineers have accepted their style as at least a valid way to have a discussion (even if they don't agree with their politics), I seem tame by comparison. Of course talking to historians or economists is a different matter, because they don't already accept that this is a valid way to have a discussion.

Actually, it is immensely useful to me to have this discussion thread to us to show people who might think I'm a crackpot, because it is evidence for the claim that in my own world nobody bats an eyelash at this way of talking.

Incidentally, I started thinking about this subject because of Krugman. In the late nineties I was a rabid Krugman fan in a style that is now popular -- "Krugman is always right" -- but was a bit strange back then when he was just another MIT economics professor hardly anyone had ever heard of. However, when he talked about technology (http://pkarchive.org/column/61100.html), I thought he was wrong, which upset me terribly because I also was almost religiously convinced he was always right. In another essay (http://pkarchive.org/personal/howiwork.html) he said it was very important to him to "Listen to the Gentiles" i.e "Pay attention to what intelligent people are saying, even if they do not have your customs or speak your analytical language." But he also said "I have no sympathy for those people who criticize the unrealistic simplifications of model-builders, and imagine that they achieve greater sophistication by avoiding stating their assumptions clearly." So it seemed clear to me that he would be willing to hear me explain why he was wrong, as long as I would be willing to state my assumption clearly.

Before I knew exactly what I was intending to say, my plan had been to figure out my assumptions well enough to meet his standards, and then ask him to help me do the rest of the work to cast it all into a real economic model. Back then he was just an MIT professor I'd taken a class from, not a famous NYTimes columnist, Nobel-prize winning celebratory, so this plan seemed natural. Profs at MIT don't object if their students point out mistakes, as long as the students are responsible about it. It took me a while to struggle through the process of figuring out what my assumptions were (assumptions? I have assumptions?). When I did I was somewhat horrified to realize that following through with my plan meant accosting him to demand he write a new Wealth of Nations for me! (He'd also left for Princeton by then and started to become famous, so my plan was logistically more difficult than I'd planned.) I had not originally realized what it was that I would be asking for, or that the whole thing would be so daunting.

I asked Lessig for advice what to do (Lessig being the only person I knew who lived in both worlds) and Lessig read me the riot act about the rules of intellectual respectability. So it seemed it would be up to me to write the new Wealth of Nations, or at least enough of it to prove the respectability of the ideas contained therein. I was trying to be a computer science student, not an economist, so that degree of effort hardly fit into my plans. I tried to ask for help at the Lab for Computer Science (now CSAIL) by giving a talk in a Dangerous Ideas seminar series, but of the professors I talked to, only David Clark was sympathetic about the need for such a thing. However, he also said very clearly that resources to support grad students to work with economists were limited and really confined to only the kind of very specific net-neutrality stuff he was pushing in concert with his protocol work, not the kind of general worldview I was thinking about. So I was amazed to find that this kind of thing falls into the cracks between different parts of academic culture.

I'm still not sure what to do, but I am more and more inclined to ignore Lessig's (implicit) advice to be apologetic and defensive about my lack of intellectual respectability. That would entail a degree of effort I can't afford, since I am still focused on proving myself as a computer scientist, not an intellectual in the humanities. (Having this discussion thread to point to is quite useful on that score.) I could just drop it (I did for a while), but I'm getting more and more upset that technology is moving much faster than the intellectual and social progress that is required to handle it. People seem to think that powerful technology is a good thing in itself, but that is not true: it is only technology in the presence of strong institutions to control its power that provide net benefits to society -- without such controls it can be fantastically destructive. From that point of view a "new economy" is not good news -- what "new" means is that all the old institutions are now out of date and their controls are no longer working. And academic culture is culturally dislocated in ways that ensure that no one anywhere is really dealing with this problem. Pretty picture, isn't it?

Nick: @Rebecca: I don't understand your argument. Why is Google selling advertising anymore about freedom than Facebook selling advertising?

It's true that Facebook doesn't make their social graph and/or demographic data available to third parties, but Google doesn't make a person's search history available to third parties either. Why is one so much worse than the other?

Piaw: Rebecca I think that having more data be more open is ideal. However, but I view it as a purely academic discussion for the same reason I view writing "Independent Cycle Touring" in TeX to be an academic discussion. Sure it could happen, but the likelihood of it happening is so slim to none that I don't find the discussion to be of interest.

Now, I do agree that technology and its adoption does grow faster than our wisdom and controls for them. However, I don't think that information technology is the big offender. Humanity's big long term problems has more to do with fossil fuels as an energy source, and that's pretty darn old technology. You can fix all the privacy problems in the world, but if we get a runaway greenhouse planet by 2100 it is all moot. Because of that you don't find me getting worked up about privacy or the open-ness of Facebook's social graph. If Facebook does become personally objectionable to me, then I will close my account. Otherwise, I will keep leveraging the work their engineers do.

Elliotte: Rebecca, going back and rereading your comments I'm not sure your analysis is right, but I'm not sure it's wrong either. Of course, I am not an economist. From my non-economist perspective it seems worth further thought, but I also suspect that economists have already thought much of this. The first thing I'd do is chat up a few economists and see if they say something like, "Oh, that's Devereaux's Theory of Productive Capacity" or some such thing.

I guess I didn't see anything particularly radical and certainly nothing objectionable in what you wrote. You're certainly not the first to notice that software economics has a different relationship to scarcity than physical goods.Nor would I see that as incompatible with capitalism. It's only really incompatible with a particular religious view of capitalism that economists connected to the real world don't believe in anyway. The theological ideologues of the Austrian School and the nattering nabobs of television news will call you a commie (or more likely these days a socialist) but you can ignore them. Their claimed convictions are really just a bad parody of economics that bares only the slightest resemblance to the real world.

You hear a lot from these fantasy world theorists because they have been well funded over the last 40 years or so by corporations and the extremely wealthy with the explicit goal of justifying wealth. Academically this is most notable at the University of Chicago, and it's even more obvious in the pseudo-economics spouted on television news. At the extreme, these paid hucksters expouse the laissez-faire theological conviction that markets are perfectly efficient and rational and that therefore whatever the markets do must be correct; but the latest economic crises have caused most folks to realize that this emperor has no clothes. Economists doing science and not theology pay no attention to this priesthood. I wish the same could be said for the popular media.

Helder: I don't think I agree with the scarcity point that Rebecca made.

Generally, if a company is making money from something it's because their are producing some kind of wealth, otherwise they won't sustain economically. It doesn't have to be productive wealth like in factories, it could be cultural (e.g. a TV Show), or something else.

Even if you think of artificial scarcity, that's only possible for a company to make when they already have a big momentum (e.g. windows or facebook dominance). Artificial scarcity sucks when you look just at it, but it's more like a "local" optimzation based on an already dominant market position.

Perhaps Facebook, Microsoft and other co. wouldn't thrive in the first place if they weren't "allowed" to make the most of their closed system. The world is a better place with a closed Facebook and proprietary Windows API than no with no Facebook or Windows at all.

TV producers try to do their best to create the right scarcity when releasing their shows and movies to maximize profit. If they were to adopt some kind of free and open philosophy that they would release their content for download on day 1, they would simply go broke and destroy wealth in the long run.

Rebecca: Thanks guys, for the great comments! I appreciate the opportunity to answer these objections, because this is a subtle issue and I can certainly see that the reasoning behind my position is far from obvious. I won't be able to do it today because I need to be out all day, and its probably just as well that I have a little time to think of how to make the reply as clear and short as possible.

Rebecca: OK, I have about four different kinds of objections to answer, and I don't want to keep this as short as I can, so I think I will arrange it carefully so I can use the ideas in one answer to help me explain the next one. That means I'll answer in the order: Elliot, Piaw then Nick & Helder.

It actually took me much of a week to write and edit an answer I liked and believed was condensed as I could make it. And despite my efforts it is still quite long. However, your reaction to my first version has impressed on me that there are some key points I need to take the space to clarify:

  1. I shouldn't have tried to talk about a system that "isn't capitalism" in too short an essay, because that is just too easily misunderstood. I take a good bit of space in the arguments below matching Elliots' disavowal of the religious view of capitalism with an explicit disavowal of the religious view of the end of capitalism.

  2. Piaw also asked a good question "why is this important?" It isn't obvious; its only something you can see once it sinks into you how dramatically decades of exponential technological growth can change the world. Since this subject is pretty crazy-making and hard to see with any perspective, I try to use an image from the past to help us predict how people from the future will see us differently than we see ourselves. I want to impress on you why future generations are likely to make very different judgements about what is and isn't important.

  3. Finally, I said rather casually that I wanted to talk about software freedom in the standard tradition, only with more intellectual and historical perpective. As I write this, though, I'm realizing the historical perspective actually changes the substance of the position, in a way I need to make clear.

And last of all I wanted to step back again and put this all in the context of what I am trying to accomplish in general, with some commentary on your reactions to the assertion that I am being intellectually radical.

These replies are split into sections so you can choose the one you like if the whole thing is too long. But the long answer to Piaw contains the idea which is key to the rest of it.

Rebecca: so, first, @Elliot -- "I'm not the first to notice that software has a different relationship with scarcity than physical goods" But my take on the difference is not the usual: I am not repeating the infinitely-copyable thing everyone goes on about, but instead focusing on the scarcity (or increasing lack thereof) of productive capacity. That way of talking challenges more directly the fundamental assumptions of economic theory, and is therefore more intellectually radical: in a formal way, it challenges the justification for capitalism. But you didn't buy my "incompatible with capitalism" argument either, which I'm glad of, because it gives me the chance to mention that just as much as you want to disown the religious view of what capitalism is, I'd like to specifically disown the religious view of the end of capitalism.

Marx talked about an "end of capitalism" as some magic system where it becomes possible for workers to seize the means of production (the factories) and make the economy work without ownership of capital. He also was predicting that capitalism must eventually end, because after all, feudalism had ended. But if you put those two assertions together, and solved the logical syllogism, you would get the assertion that feudalism ended because the serfs seized the means of production (the farms) and made an economy work without the ownership of land. That isn't true! I grew up in Iowa. There are landowners there who own more acreage than most fabled medieval kings. Nobody challenges their ownership, and yet nobody would call that system feudalism. Why not? Because their fields are harvested by machines, not serfs. Feudalism ended not because the landowning class changed their landowning ways. It was because the land-working class, the serfs, left for better jobs in factories; and the landowners don't care anymore, because they eventually replaced the serfs with machines. The end of feudalism was not the end of the ownership of land, it was the end of a social position and set of perogatives that went along with that ownership. If your vassals are machines, you can't lord over them.

Similarly, in a non-religious view of the end of capitalism, it will come about not because the capitalist class, the class that owns factories, will ever disappear or change their ways, but because the proletariat will go away -- they will leave for better jobs doing something else, and the factory owners will replace them with machines. And in fact you can see that that is already happening. Are you proletariat? Am I? If I create an STL model and have it printed by Shapeways, I am manufacturing something, but I am not proletariat. Shapeways is certainly raising capital to buy their printers, which strictly speaking makes them "capitalists," but in a social sense they are not capitalists, because their relationship with me has a different power structure from the one Marx objected to so violently. I am not a "prole" being lorded over by them. It isn't the big dramatic revolution Marx envisioned; it is almost so subtle you can miss it entirely. What if capitalism ended and nobody noticed?

Rebecca: Next @Piaw -- Piaw said he didn't think information technology was the biggest offender in the realm of technology that grows faster than our controls of it; for instance he thought global warming was a more pressing immediate problem.

I definitely agree that the immediate problems created by information technology and the associated social change are, right now, small by comparison to global warming. It would be nice if we could tackle the most immediate and pressing problems first, and leave the others until they get big enough to worry about. But the problems of a new economy have the unique feature of being pressing not because they are necessarily immediate or large (right now), but because if they are left undealt-with they can destroy the political system's ability to effectively handle these or any other problems.

I'm a believer in understanding the present through the lens of the past: since we have so much more perspective about things that happened many, many years ago, we can interpret the present and predict our future by understanding how things that are happening to us now are analogous to things that happened long ago. Towards that end, I'd like to point out an analogy with a fictional image of people who, very early on in the previous "new economy," tried to push new ideas of freedom and met with the objection that they were making too big a deal over problems that were too unimportant. (That this image is fictional is part of my point -- bear with me.) My image comes from a dramatic schene in the musical 1776 (whose synopsis can found at http://en.wikipedia.org/wiki/1776_%28musical%29, scene seven), in which an "obnoxious and disliked" John Adams almost throws away the vote of Edward Rutledge and the rest of the southern delegation over the insistence that a condemnation of slavery be included in the Declaration of Independence. He drops this insistence only when he is persuaded to change his mind by Franklin's arguments that the fight with the British is more important than any argument on the subject -- "we must hang together or we will hang separately."

In fact, nothing like that ever happened: as the historical notes on the Wikipedia page say, everyone at the time was so totally in agreement that the issue was too unimportant to be bothered to fight about it, let alone have the big showdown depicted in the musical, with Rutledge dramatically but improbably singing a spookily beautiful song in defense of the triangle trade: "Molasses to Rum to Slaves." The scene was inserted to satisfy the sensibilities of modern audiences that whether or not such a showdown happened, it should have happened.

Why are our sensibilities so much different than reality? Why are we imposing on the past the idea that the fight ought to have been important to them, even though it wasn't, that John Adams ought to have made himself obnoxious and disliked in his intransigent insistence on America's founding values of freedom, even though he didn't and he wasn't, that Franklin ought to have argued with great reluctance that the fight with the British was more important, even though he never made that argument (because it went without saying), and that Edward Rutledge ought to have been a spooky, equally intransigent apologist for slavery, even though he wasn't either (later he freed his own slaves). We are imposing this false narrative because we are looking backwards through a lens where we know something about the future the real actors had no idea about. This is important to understand because we may be in a similar position with respect to future generations -- they will think we should have had a fight we in fact have no inclination to have, because they will know something we don't know about our own future. The central argument I want to make to Piaw hinges on an understanding of this thing that later generations are likely to know about our future that we currently have difficulty imagining.

So forgive me if I belabor this point: it is key to my answer both to Piaw's question and also to Nick & Helder's objection. Its going to take a little bit of space to set up the scenery, because it is non-trivial for me to pull my audience back into a historical mentality very different than our own. But I want to go through this exercise in order to pull out of it a general understanding of how and why political ways of thinking shift in the face of dramatic technological change -- which we can use to predict how our own future and the changing shape of our politics.

What is it that the real people behind this story didn't know that we know now? Start with John Adams: to understand why the real John Adams wouldn't have been very obnoxious about pushing his idea of freedom on slaveowners in 1776, realize that his idea of freedom, if restated in economic rather than moral terms, would have been the assertion that "it should be an absolute right of all citizens of the United States to leave the farm where they were born and seek a better job in a factory." But making a big deal about such a right in 1776 would have been absurd. There weren't very many factories, and they were sufficiently inefficient that the jobs they provided were unappealing at best. For example, at the time Jefferson wrote in total seriousness about the moral superiority of agrarian over industrial life: such a sentiment seemed reasonable in 1776, because, not to put too fine a point on it, factory life was horrible. Because of this, the politicians in 1776, like Adams or Hamilton, who were deeply enamored of industrialization, pushed their obsession with an apologetic air, as if they were only talking about their own personal predilections, which they took great pains to make clear they were not going to impose on anyone else. The real John Adams was not nearly as obnoxious as our imaginary version of him: we imagine him differently only because we wish he had been different.

We wish him different than he really was because there was one important fact that the people of 1776 may have understood intellectually, but whose full social significance they did not even begin to wrap their minds around: the factories were getting better exponentially, while the farms would always stay the same. Moore's Law-like growth rates in technology are not a new phenomenon. Improvements in the production of cotton textiles in the early nineteenth century stunned observers like the improvements in chips or memory impress us today -- and after cotton-spinning had its run, other advances stepped into the limelight each in turn, as the article at www.theatlantic.com/magazine/archive/1999/10/beyond-the-information-revolution/4658/ tries to impress on us. We forget that dramatic exponential improvements in technology are not a new phenomenon. We also forget that if exponential growth runs for decades, it changes things... and it changes things more than anybody at the beginning of such a run dares to imagine.

This brings us to the other characters in our story who made choices we now wish they had made differently (and they also later regretted). Edward Rutledge and Thomas Jefferson didn't exactly defend slavery; they were quite open about being uncomfortable with it, but they didn't consider this discomfort important enough to do much about. That position would also have made sense in 1776: landowners had owned slaves since antiquity, but slavery in ancient times was not fantastically onerous compared to the other options available to the lower classes at the time -- there are famous stories of enterprising Greek and Roman slaves who earned their freedom and rose to high positions in society. Rutledge and Jefferson probably thought they were offering their slaves a similar deal, and that all in all, it wasn't half bad.

They were wrong. American slavery turned out to be something unique, entirely different than the slavery of antiquity. My American history teacher presented this as a "paradox," that the country that was founded on an ideal of freedom also was home to the most brutal system of slavery the world has ever seen. But I think this "paradox" is quite understandable: it is two parts of the same phenomenon. Ask the question: why could ancient slaveowners afford to be relatively benign? Because they were also relatively secure in their position -- their slaves knew as well as they did that the lower classes didn't have many other better options. Sally Hemmings, Jefferson's lover, considered running away when she was in France with him, but Jefferson successfully convinced her that she would get a better deal staying with him. He didn't have to take her home in chains: she left the possibility of freedom in France and came back of her own free will (if slightly wistfully).

But as time passed and the factory jobs in the North proceeded in their Moore's Law trajectory, eventually the alternatives available to the lower classes began to look better than in any time before in human history. The slaves Harriet Tubman smuggled to Canada arrived to find options exponentially better than those Hemmings could have hoped for if she had left Jefferson. As a result, for the first time in human history, slaves had to be kept in chains.

In the more abstract terms I was using before, slavery was relatively benign when the scarcity of opportunity that bound slaves to their masters was real, but as other opportunities became available, this "real scarcity" became "artificial," something that had to be enforced with chains -- and laws. That is where the slaveowners transformed into something uniquely brutal: to preserve their way of life they needed not only to put their slaves in chains, they also needed to take over the political and legal apparatus of society to keep those chains legal. There came into existence the one-issue politician -- the politician whose motive to enter political life was not to understand or solve the problems facing the nation, to listen to other points of view or forge compromises, or any of the other natural things that a normal politician does, but merely to fight for one issue only: to write into law the "artificial scarcity" that was necessary to preserve the way of life of his constituents, and play whatever brutal political tricks were necessary to keep those laws on the books. Political violence was not off the table - a recent editorial "When Congress Was Armed And Dangerous" (www.nytimes.com/2011/01/12/opinion/12freeman.htm) reminds us that that the incitements to violence of today's politics are tame compared to the violence of the politics of the 1830's, 40's and 50's. The early 1860's were the culmination of the decades-long disaster we wish the Founding Fathers had foreseen and averted. We wish they had had the argument about slavery while there was still time for it to be a mere argument -- before the elite it supported poisoned the political system to provide for its defense.

They, in their old age, wished it too: forty-five years after Jefferson declined to make slavery an important issue in the debate over the Declaration of Independence, he was awakened by the "firebell in the night" in the form of the Missouri compromise. News of this fight caused him to wake up to the real situation, and he wrote to a friend "we have the wolf by the ears, and we can neither hold him, nor safely let him go. Justice is in one scale, and self-preservation in the other.... I regret that I am now to die in the belief that the useless sacrifice of themselves by the generation of '76, to acquire self government and happiness to their country, is to be thrown away by the unwise and unworthy passions of their sons, and that my only consolation is to be that I live not to weep over it."

So, forty-five years after he declined to engage with an "unimportant," "academic" question, he said of the consequences of that decision that his "only consolation is to be that I live not to weep over it." He had not counted on the "unwise and unworthy passions" of his sons -- for his own part, he would have been happy to let slavery lapse when economic conditions no longer gave it moral justification. However, the next generation had different ideas -- they wanted to do anything it took to preserve their prerogatives. By that point the choices he had were defined by the company he kept: since he was a Virginian, he would have had to go to war for Virginia, and fight against everything he believed in. He would have wanted to go back to the time when he could have made a choice that was his own, but that time was past and gone, and no matter how "unwise and unworthy" were the passions which were now controlling him, he had no choice but to be swept along by them.

This is my argument about why we should pay attention to "unimportant" and "academic" questions. In 1776 it was equally well "academic" to consider looking ahead through seventy five years of exponential growth to project the economic conditions of 1860, and use that projection to motivate a serious consideration of abstract principles that were faintly absurd in the conditions of the time, and would only become burning issues decades and decades later. Yet we wish they had done just that, and in their old age they also fervently wished that they had too. This seems strange: why plan for 1860 in 1776? Why plan for 2085 in 2010? Why not just cross that bridge when we come to it? Let the next generation worry about their own problems; why should we think for them? we have our own burning issues to worry about! The projected problems of 2085 are abstract, academic, and unimportant to us. Why not leave them alone and worry about our present burning concerns?

The difficulty is that if we don't leave them alone, if we don't project the battle over our values absurdly into the future and start dealing with the shape of our conflict as it will look when transformed by many decades of time and technological change, we may well lose the political freedom of action to solve these problems non-violently -- or to handle any others either. We will have "a wolf by the ears." We wish the leaders of 1776 had envisioned and taken seriously the problems of 1860, because in 1776 they were still reasonable people who could talk to each other and effectively work out a compromise. By 1860 that option was no longer available. The problem is that when these kinds of problems eventually stop being "academic," when they stop being the dreams of intellectuals and become burning issues for millions of real people, the fire burns too hot. Too many powerful people choose to "take a wolf by the ears". This wolf may well consume the entire political and legal system and make it impossible to handle that problem or any other, until the only option left to restore the body politic is civil war. Once that happens everyone will fervently wish they could go back to the time when the battles were "merely academic".

I worked out this story around 2003, because starting in 1998 I had wanted to have a name to give to a nameless anxiety (in between, I thrashed around for quite a while figuring out which historical image I believed in the most). When I was sure, I considered going to Krugman to use this story to fuel a temper tantrum about how he absolutely had to stop ignoring the geeks who tried to talk to him about "freedom." But I was inhibited: I was afraid the whole argument would come across as intellectually suspect and emotionally manipulative. Besides, the immediate danger this story predicted -- that politics would devolve into 1830's style one-issue paralysis -- seemed a bit preposterous in 2003. Krugman wasn't happy about the 2002 election, but it wasn't that bad. But now I feel some remorse in the other direction: it has gotten worse faster than I ever dreamed it would. I didn't predict what has been happening exactly. I was very focussed on tech, so I didn't expect the politicians in the service of the powerful people with "a wolf by the ears" to be funded by the oldest old economy powers imaginable -- banking and oil. That result isn't incompatible with this argument: that very traditional capitalism should gain an unprecedented brutality just when the new economy is promising new freedoms, is, this line of reasoning predicts, exactly what you should expect. I'm afraid now that Krugman will be mad at me for not bothering him in 2003, because he would have wanted the extra political freedom of action more than he would have resented the very improper intellectual argument.

Rebecca: Now that I've laid the groundwork, it is much easier for me to answer Nick and Helder. Both of you are essentially telling me that I'm being unreasonable and obnoxious. I will break dramatically with Stallman by completely conceding this objection across the board. I am being unreasonably obnoxious. However, there is a general method to this madness: as I explained in the image above, I am essentially pushing values that will make sense in a few decades, and pulling them back to the current time, admittedly somewhat inappropriately. The main reason I think it is important to do this is not because I think the values I am promoting should necessarily apply in an absolute way right now (as Stallman would say) but instead because it is a lot easier to start this fight now than to deal with it later. The reason to fight now is exactly because the opponents are still reasonable, whereas they might not be later. Unlike Stallman, I want to emphasize my respect (and gratitude) for reasonable objections to my position. My opponents are unlikely to shoot me, which is not a priviledge to be taken for granted, and one I want to take advantage of while I have it.

To address the specifics of your objections: Helder complained that companies needed the tactics I called "exploitation of artificial scarcity" to recoup their original investment -- if that wasn't allowed, the service wouldn't exist at all, which would be worse. Nick objected that 80 or 90% of Facebook's planned revenue was from essentially similar sources as Google's, so why should I complain just because of the other 10 or 20%? That was what I was complaining about -- that a portion of their revenue comes from closing their platform and taxing developers -- but that is only a small part of Ms. Sandberg's diversified revenue plans, and I admit that the rest is fairly logically indistinguishable from Google's strategy. In both cases it can easily be argued I am taking an extremely unreasonable hard line.

Let's delve into a dissection of how unreasonable I'm being. In both cases the unreasonableness comes from a problem with my general argument: I said that Mark Zuckerberg is not a capitalist, that is to say, he is not raising capital to buy physical objects that make his workers more productive -- but that is not entirely true. Facebook's data centers are expensive, and they are necessary to allow his employees to do their work.

The best story on this subject might also be the exception to prove the rule. The most "capitalist" story about a tech mogul's start is the account of how Larry & Sergey began by maxing out their credit cards to buy a terabyte of disk (http://books.google.com/books?id=UVz06fnwJvUC&pg=PA6#v=onepage&q&f=false) This story could have been written by Horatio Alger -- it so exactly follows the script of a standard capitalist's start. But for all that, L&S did not make all the standard capitalist noise. I was a fan of Google very early, and may even have pulled data from their original terabyte, and I never heard about how they needed to put restrictions on me to recoup their investment. A year or so later when I talked to Googlers at their recruiting events, I thought they were almost bizarrely chipper about their total lack of revenue strategy. Yet they got rich anyway. And now that same terabyte costs less than $100.

That last is the key point: it is not that the investments aren't significant and need to be recouped. It is that their size is shrinking exponentially. In the previous section, I emphasized the enormous transformative effect of decades of exponential improvement in technology, and the importance of extrapolating one's values forward through those decades, even if that means making assertions that currently seem absurd. The investments that need to be recouped are real and significant, but they are also shrinking, at an exponential rate. So the economic basis for the assertion of a right to restrict people's freedom of opportunity in order to recoup investment is temporary at best. And, as I described in the last part, asserting prerogatives on the basis of scarcity which is now real but will soon be artificial is ... dangerous. Even if you honestly think that you will change with changing times, you may find to your sincere horror that when the time comes to make a new choice, you no longer have the option. Your choices will be dictated by the company you keep. I didn't say it earlier, but one of the things that worries me the most about Facebook is that they seem to have gotten in bed with Goldman Sachs. The idea of soon-to-be multibillionare tech moguls taking lessons in political tactics from Lloyd Blankfein doesn't make me happy.

I am glad that you objected, and gave me licence to take the space to explain this more carefully, because actually my point is more subtle and nuanced than my original account -- which I was oversimplifying to save space -- suggested. (Lessig told me I had to write an account that fit in five pages in order to hope to be heard, to which I reacted with some despair. I can't! There is more than five pages worth of complexity to this problem! If I try to reduce beyond a certain point I burn the sauce. You are watching me struggle in public with this problem.)

There is another part of the mythology of "the end of capitalism" that I should take the time to disavow. The mythology talks as if there is one clear historical moment when an angel of annunciation appears and declares that a new social order has arrived. In reality it isn't like that. It may seem like that in history books when a few pages cover the forty-five years between the time Jefferson scratched out his denunciation of slavery in the Declaration of Independence, and when he wrote the "firebell in the night" letter. But in real, lived, life, forty-five years are most of an adult lifetime. When Jefferson wrote "justice is in one scale, self-preservation in the other," could he point to a particular moment in the previous forty-five years when the hand of justice had moved the weight to the other side of the scale? There was no one moment: just undramatic but steady exponential growth in the productivity of the industrial life whose moral inferiority seemed so obvious four decades earlier. He hadn't been paying attention, no clarion call was blown at the moment of transition (if such a moment even existed), so when he heard the alarm that woke him up to how much the world had changed, it was too late.

In a similar way, I think the only clear judgement we can make now is that we are in the middle of a transition. There was some point in time when disk drives were so expensive and the data stored on them so trivial that the right of their owners to recoup their investment clearly outweighed all other concerns. There will be some other point, decades from now, when the disk drives are so cheap and the data on them so crucial to the livelihood of their users and the health of the economy, that the right to "software freedom" will clearly outweigh the rights of the owner of the hardware. There will be some point clearly "too early" to fight for new freedoms, and some point also clearly "too late." In between these two points in time? Merely a long, slow, steady pace of change. In that time the hand of justice may refuse to put the weight on either side of the scale, no matter how much we plead with her for clarity. We may live our whole lives between two eras, where no judgement can be made black or white, where everything is grey.

But people want solid judgement: they want to know what is right and wrong. This greyness is dangerous, for it opens a vacuum of power eagerly filled by the worst sorts, causes a nameless anxiety, and induces political panic. So what can you do? I think it is impossible to make black-and-white judgements about which era's values should apply. But one can say with certainty that it is desirable to preserve the freedom of political action that will make it possible to defend the values of a new era at the point when it becomes unequivocally clear that that fight is appropriate. I'm really not very upset if companies do whatever it takes to recoup their initial investment -- as long as it's temporary. But what guarantee do I have that if Facebook realizes its 50 billion market capitalization, they won't use that money to buy politicians to justify continuing their practices indefinitely? Their association with Goldman doesn't reassure me on that score. I trust Google more to allow, and even participate in, honest political discussion. That is the issue which I'm really worried about. The speed and effectiveness with which companies are already buying the political discourse has taken me by surprise, even though I had reason to predict it. When powerful people "have a wolf by the ears" they can become truly terrifying.

Rebecca: You could have a point that this kind of argument isn't as unorthodox as it once was. After all, plenty of real economists have been talking about the "new economy" -- Peter Drucker in "Beyond the Information Revolution" (linked above), Hal Varian in "Information Rules", Larry Summers in a speech "The New Wealth of Nations", even Krugman in his short essay "The Dynamo and the Microchip", and a bunch of younger economists surveyed in David Brooks's recent editorial "The Protocol Society." But I don't share Brooks's satisfaction in the observation "it is striking how [these "new economy" theorists] are moving away from mathematical modeling and toward fields like sociology and anthropology." There is a sense that my attitude is now more orthodox than the orthodoxy -- though the argument I sketched here is not a mathematical model, it was very much designed and intended to be turned into one, and I am vehemently in agreement with Krugman's attitude that nothing less than math is good enough. Economics is supposed to be a science where mathematical discipline forces intellectual honesty and provides a bulwark of defense against corruption.

I'm in shock that this crop of "new economy talk" is so loose, sloppy and journalistic ... and because it is so intellectually sloppy, it is hard even to tell whether it is corrupt or not. For instance, though I liked the historical observations and the conclusions drawn from them in Drucker's 1999 essay as much as anything I've read, his paean to the revolutionary effects of e-commerce reads so much like dot.com advertising it is almost embarrassing. Though, to be fair, there are some hints at the essay's conclusion of a consciousness that an industrial revolution isn't just about being sprinkled with technological fairy dust magic, but also involves some aspect of painful social upheaval -- even so, his story is so strangely upbeat, especially since, given his clearly deep understanding of historical thinking, he should have known better ... one wonders whom he was trying not to offend. Similarly, can we trust Summer's purported genius or do his millions in pay from various banks, um, influence his thinking? and it goes on... Krugman is about the only one left I'm sure I can trust. People who do this for a living should be doing better than this! Even though I understand Lessig's point about my intellectual radicalism, it's hard for me to want to follow it, because some part of me just wants to challenge these guys to show enough intellectual rigor to prove that I can trust them.

To be fair, I admit part of Brooks's point that there is something "anthropological" about a new economy: it tends to drive everyone mad. Think about the new industrial rich of the nineteenth century -- dressed to the nines like pseudo-aristocrats, top hat, cane, affected accent, and maybe a bought marriage to get themselves a title too -- they were cuckoo like clocks! Deep, wrenching, technologically-driven change does that to people. But just because it is madness doesn't mean it doesn't have method to it amenable to mathematical modeling. Krugman wrote (http://pkarchive.org/personal/incidents.html) that when he was young, Azimov's "Foundation Trilogy" inspired him to dream of growing up to be a "psychohistorian who use[s his] understanding of the mathematics of society to save civilization as the Galactic Empire collapses" but he said "Unfortunately, there's no such thing (yet)." Do you think he could be tempted with the possibility of a real opportunity to be a "psychohistorian?"

I never meant for this to be a fight I pursue on my own. The whole reason I translated it into an economic and historical language is that I wanted to convince the people who do this for a living to take it up for me. I can't afford to fight alone: I don't have the time to spend caught up in political arguments, nor can I afford to make enemies of people I might want to work for. I'm making these arguments here mostly because having a record of a debate with other tech people will help convince intellectuals of my seriousness. I'm having some difficulty getting you to understand, but I think I would have a terribly uphill battle trying to convince intellectuals that I am not "crying wolf" -- they have just heard this kind of argument misused too many times before. I have to admit that I am crying wolf, but the reason I'm doing it is because this time there really is a wolf!

Piaw: Here's the thing, Rebecca: it wasn't possible to have that argument about freedom/slavery in 1776. The changes brought about later made it possible to have that argument much later. The civil war was horrifying, but I really am not sure if it was possible to change the system earlier.

Ruchira: Hi Rebecca,

I haven't yet read this long conversation. But if you're not already familiar with the concepts of rivalrous vs nonrivalrous and excludable vs nonexcludable

http://en.wikipedia.org/wiki/Rivalry_(economics)

these terms might help connect you with what others have thought about the issues you're talking about. See in particular the "Possible solutions" under Public goods:

http://en.wikipedia.org/wiki/Public_good Daniel Stoddart: I've said it before and I'll say it again: I wouldn't be so quick to count Google out of social. Oh, I know it's cool to diss Buzz like Scoble has been doing for a while now, saying that he has more followers on Quora. But that's kind of an apples and oranges comparison.

Ruchira: Rebecca: Okay, now I have read the long conversation. I do think you have an important point but I haven't digested it enough to form an opinion (which would require judging how it interconnects with other important issues). Just a couple of tangential thoughts:

1) If you fear the loss of freedom, watch out for the military-industrial complex. You've elsewhere described some of the benefits from it, but this is precisely why you shouldn't be lulled into a false sense of comfort by these benefits, just as you're thinking others should not be lulled into a false sense of comfort about the issues you're describing. Think about the long-term consequences of untouchable and unaccountable defense spending, and about the interlocking attributes of the status quo that keep it untouchable and unaccountable. They are fundamentally interconnected with information hiding and lack of transparency.

2) There exists a kind of psychohistory: cliodynamics. http://cliodynamics.info/ As far as I know it's not yet sufficiently developed to apply to the future, though.

Ruchira: Rebecca: On that note, I wonder what you think of Noam Scheiber's article "Why Wikileaks Will Kill Big Business and Big Government" http://www.tnr.com/article/politics/80481/game-changer He's certainly thinking about how technology will cause massive changes in how society is organized.

Helder: (note: I didn't read the whole thing with full attention) In the case of some closed systems. the cost of making it open (and lack business justification for that) and a general necessity to protect the business usually outweighs the need of return on investment by far. So it's not all about ROE.

Also, society's technology development and the shrinking size on capital needs for new business (e.g. terabyte cost), don't usually favors closed system business in the long run, it probably only weakens it. You can have a walled garden, but as the outside ground level goes up, the wall gets shorter and shorter. Just look at how the operating system is increasingly less relevant as most action gravitates towards the browser. Another example (perhaps to be seen?) is the credit card industry as I mentioned in my first comment.

Rebecca: Thanks for reading this long, long post and giving me feedback!

Ruchira: Helder: Facebook makes the wall shorter for its developers (I'm sure Zynga think they've grown wealth due to Facebook). This directly caused an outcry over privacy (the walled garden is not walled any more).

Rebecca: Hope you find it food for thought! You might also be interested in David Singh Grewal's Network Power http://amzn.to/h72nNJ It discusses a lot of relevant issues, and doesn't assume a lot of background (since it's targeted at multiple disciplines), so I found it very helpful, as an outsider like you. After that, you might (or might not) become interested in the coordination problem--if you do, Richard Tuck's free riding http://amzn.to/f9goyT may be of interest.

Rebecca: Thanks, Ruchira, for the links.

How does Boston compare to SV and what do MIT and Stanford have to do with it?

2010-01-01 08:00:00

This is an archive of an old Google Buzz conversation on MIT vs. Stanford and Silicon Valley vs. Boston

There's no reason why the Boston area shouldn't be as much a hotbed of startups as Silicon Valley is. By contrast, there are lots of reasons why NYC is no good for startups. Nevertheless, Paul Graham gave up on the Boston area, so there must be something that hinders startup formation in the area.

Kevin: This has nothing to do with money, or talents, or what it. All it matters is "entrepreneur density".

Boston may have the money, the talent, the intelligence, but does it have an entrepreneurial spirit and enough of a density?

Marya: From http://www.xconomy.com/boston/2009/01/22/paul-graham-and-y-combinator-to-leave-cambridge-stay-in-silicon-valley-year-round/ "Graham says the reasons are mostly personal, having to do with the impending birth of his child and the desire not to try and be a bi-coastal parent" But then immediately after, we see he says: "Boston just doesn’t have the startup culture that the Valley does. It has more startup culture than anywhere else, but the gap between number 1 and number 2 is huge; nothing makes that clearer than alternating between them." Here's an interview: http://www.xconomy.com/boston/2009/03/10/paul-graham-on-why-boston-should-worry-about-its-future-as-a-tech-hub-says-region-focuses-on-ideas-not-startups-while-investors-lack-confidence/ Funny, because Graham seemed partial to the Boston area, earlier: http://www.paulgraham.com/cities.html http://www.paulgraham.com/siliconvalley.html

Rebecca: I think he's partial because he likes the intellectual side of Boston, enough to make him sad that it doesn't match SV for startup culture. I know the feeling. I guess I have seen things picking up here recently, enough to make me a little wistful that I have given my intellectual side priority over any entrepreneurial urges I might have, for the time being.

Scoble: I disagree that Boston is #2. Seattle and Tel Aviv are better and even Boulder is better, in my view.

Piaw: Seattle does have a large number of Amazon and Microsoft millionaires funding startups. They just don't get much press. I wasn't aware that Boulder is a hot-bed of startup activity.

Rebecca: On the comment "there is no reason Boston shouldn't be a hotbed of startups..." Culture matters. MIT's culture is more intellectual than entrepreneurial, and Harvard even more so. I'll tell you a story: I was hanging out in the MIT computer club in the early nineties, when the web was just starting, and someone suggested that one could claim domain names to make money reselling them. Everyone in the room agreed that was the dumbest idea they had ever heard. It was crazy. Everything was available back then, you know. And everyone in that room kindof knew they were leaving money on the ground. And yet we were part of this club that culturally needed to feel ourselves above wanting to make money that way. Or later, in the late nineties I was hanging around Philip Greenspun, who was writing a book on database backed web development. He was really getting picked on by professors for doing stuff that wasn't academic enough, that wasn't generating new ideas. He only barely graduated because he was seen as too entrepreneurial, too commercial, not original enough. Would that have happened at Stanford? I read an interview with Rajiv Motwani where he said he dug up extra disk drives whenever the Google founders asked for them, while they were still grad students. I don't think that wouldn't happen at MIT: a professor wouldn't give a grad student lots of stuff just to build something on their own that they were going to commercialize eventually. They probably would encounter complaints they weren't doing enough "real science". There was much resentment of Greenspun for the bandwidth he "stole" from MIT while starting his venture, for instance, and people weren't shy about telling him. I'm not sure I like this about MIT.

Piaw: One my friends once turned down a full time offer at Netscape (after his internship) to return to graduate school. He said at that time, "I didn't go to graduate school to get rich." Years later he said, "I succeeded... at not getting rich."

Dan: As the friend in question (I interned at Netscape in '96 and '97), I'm reasonably sure I wouldn't have gotten very rich by dropping out of grad school. Instead, by sticking with academia, I've managed to do reasonably well for myself with consulting on the side, and it's not like academics are paid peanuts, either.

Now, if I'd blown off academia altogether and joined Netscape in '93, which I have to say was a strong temptation, things would have worked out very differently.

Piaw: Well, there's always going to be another hot startup. :-) That's what Reed Hastings told me in 1995.

Rebecca: A venture capitalist with Silicon Valley habits (a very singular and strange beast around here) recently set up camp at MIT, and I tried to give him a little "Toto, you're not in Kansas anymore" speech. That is to say, I was trying to tell him that the habits one got from making money from Stanford students wouldn't work at MIT. It isn't that one couldn't make money investing in MIT students -- if one was patient enough, maybe one could make more, maybe a lot more. But it would only work if one understood how utterly different MIT culture is, and did something different out of an understanding of what one was buying. I didn't do a very good job talking to him, though; maybe I should try again by stepping back and talking more generally about the essential difference of MIT culture. You know, if I did that, maybe the Boston mayor's office might want to hear this too. Hmmm... you've given me an idea.

Marya: Apropos, Philip G just posted about his experience attending a conference on angel investing in Boston: http://blogs.law.harvard.edu/philg/2010/06/01/boston-angel-investors/ He's in cranky old man mode, as usual. I imagine him shaking his cane at the conference presenters from the rocking chair on his front porch. Fun quotes: 'Asked if it wouldn’t make more sense to apply capital in rapidly developing countries such as Brazil and China, the speakers responded that being an angel was more about having fun than getting a good return on investment. (Not sure whose idea of “fun” included sitting in board meetings with frustrated entrepreneurs, but personally I would rather be flying a helicopter or going to the beach.)... 'Nobody had thought about the question of whether Boston in fact needs more angel investors or venture capital. Nobody could point to an example of a good startup that had been unable to obtain funding. However, there were examples of startups, notably Facebook, that had moved to California because of superior access to capital and other resources out there... 'Nobody at the conference could answer a macro question: With the US private GDP shrinking, why do we need capital at all?'

Piaw: The GDP question is easily answered. Not all sectors are shrinking. For instance, Silicon Valley is growing dramatically right now. I wouldn't be able to help people negotiate 30% increases in compensation otherwise (well, more like 50% increases, depending on how you compute). The number of pre-IPO companies that are extremely profitable is also surprisingly high.

And personally, I think that investing in places like China and Brazil is asking for trouble unless you are well attuned to the local culture, so whoever answered the question with "it's fun" is being an idiot.

The fact that Facebook was asked by Accel to move to Palo Alto should definitely be something Boston area VCs should berate themselves about. But that "forced move" was very good for Facebook. They acquired Jeff Rothschild, Marc Kwiatkowski, Steve Grimm, Paul Bucheit, Sanjeev Singh, and many others by being in Palo Alto that would not have moved to Boston for Facebook no matter what. It's not clear to me that staying in Boston was an optimal move for Facebook no matter what. At least, not before things got dramatically better in Boston for startups.

Marya: The GDP question is easily answered. Not all sectors are shrinking. For instance, Silicon Valley is growing dramatically right now

I'm guessing medical technology and biotech are still growing. What else?

Someone pointed this out in the comments, and Philip addressed it; he argues that angel investors are unlikely to get a good return on their investment (partial quote): "...we definitely need some sources of capital... But every part of the U.S. financial system, from venture capital right up through investment banks, is sized for an expanding private economy. That means it is oversized for the economy that we have. Which means that the returns to additional capital should be very small...."

He doesn't provide any supporting evidence, though.

Piaw: Social networks and social gaming is growing dramatically and fast.

Rebecca: Thanks, Marya, for pointing out Philip's blog post. I think the telling quote from it is this: "What evidence is there that the Boston area has ever been a sustainable place for startups to flourish? When the skills necessary to build a computer were extremely rare, minicomputer makers were successful. As soon as the skills ... became more widespread, nearly all of the new companies started up in California, Texas, Seattle, etc. When building a functional Internet application required working at the state of the art, the Boston area was home to a lot of pioneering Internet companies, e.g., Lycos. As soon as it became possible for an average programmer to ... work effectively, Boston faded to insignificance." Philip is saying Boston can only compete when it can leverage skills that only it has. That's because its ability to handle business and commercialization are so comparatively terrible that when the technological skill becomes commoditized, other cities will do much better.

But it does often get cutting-edge technical insight and skills first -- and then completely drops the ball on developing them. I find this frustrating. Now that I think about it, it seems like Boston's leaders are frustrated by this too. But I think they're making a mistake trying to remake Boston in Silicon Valley's image. If we tried to be you, at best we would be a pathetic shadow of you. We could only be successful by being ourselves, but getting better at it.

There is a fundamental problem: the people at the cutting edge aren't interested in practical things, or they wouldn't be bothering with the cutting edge. Though it might seem strange to say now, the guy who set up the hundredth web server was quite an impractical intellectual. Who needs a web server when there are only 99 others (and no browsers yet, remember)? We were laughing at him, and he was protesting the worth of this endeavor merely out of a deep intellectual faith that this was the future, no matter how silly it seemed. Over and over I have seen the lonely obsessions of impractical intellectuals become practical in two or three years, become lucrative in five or eight, and become massive industries in seven to twelve years.

So if the nascent idea that will become a huge industry in a dozen years shows up first in Boston, why can't we take advantage of it? The problem is that the people who hone their skill at nascent ideas that won't be truly lucrative for half a decade at least, are by definition impractical, too impractical to know how to take advantage of being first. But maybe Boston could become a winner if it could figure out how to pair these people up with practical types who could take advantage of the early warning about the shape of the future, and leverage the competitive advantage of access to skills no-one else has. It would take a very particular kind of practicality, different from the standard SV thing. Maybe I'm wrong, though; maybe the market just doesn't reward being first, especially if it means being on the bleeding edge of practicality. What do you think?

Piaw: Being 5 or 10 years ahead of your time is terrible. What you want to be is just 18 months or even 12 months ahead of your time, so you have just enough time to build product before the market explodes. My book covers this part as well. :-)

Marya: Rebecca, I don't know the Boston area well enough to form an opinion. I've been here two years, but I'm certainly not in the thick of things (if there is a "thick" to speak of, I haven't seen it). My guess would be that Boston doesn't have the population to be a huge center of anything, but that's a stab in the dark.

Even so, this old survey (2004) says that Boston is #2 in biotech, close behind San Diego: http://www.forbes.com/2004/06/07/cz_kd_0607biotechclusters.html So why is Boston so successful in biotech if the people here broadly lack an interest in business, or are "impractical"? (Here's a snippet from the article: "...When the most successful San Diego biotech company, IDEC Pharmaceuticals, merged with Biogen last year to become Biogen Idec (nasdaq: BIIB - news - people ), it officially moved its headquarters to Biogen's hometown of Cambridge, Mass." Take that, San Diego!)

When you talk about a certain type of person being "impractical", I don't think that's really the issue. Such people can be very practical when it comes to pursuing their own particular kind of ambition. But their interests may not lie in the commercialization of an idea. Some extremely intelligent, highly skilled people just don't care about money and commerce, and may even despise them.

Even with all that, I find it hard to believe that the intelligentsia of New England are so much more cerebral than their cousins in Silicon Valley. There's certainly a puritan ethic in New England, but I don't think that drives the business culture.

Rebecca: Marya, thanks for pointing out to me I wasn't being clear (I'm kindof practicing explaining something on you, that I might try to say more formally later, hence the spam of your comment field. I hope you don't mind.) You question " why is Boston so successful in biotech if the people here broadly lack an interest in business?" made me realize I'm not talking about people broadly -- there are plenty of business people in Boston, as everywhere. I'm talking about a particular kind of person, or even more specifically, a particular kind of relationship. Remember I contrasted the reports of Rajeev Motwani's treatment of the Google guys with the MIT CS lab's treatment of Philip? In general, I am saying that a university town like Palo Alto or Cambridge will be a magnet for ultra-ambitious young people who look for help realizing their ambitions, and a group of adults who are looking to attract such young people and enable those ambition, and there is a characteristic relationship between them with (perhaps unspoken) terms and expectations. The idea I'm really dancing around is that these terms & expectations are very different at MIT than (I've heard) they are at Stanford. Though there may not be very many people total directly involved in this relationship, it will still determine a great deal of what the city can and can't accomplish, because it is a combination of the energy of very ambitious young people and the mentorship of experienced adults that makes big things possible.

My impression is that the most ambitious people at Stanford dream of starting the next big internet company, and if they show enough energy and talent, they will impress professors who will then open their Rolodex and tell their network of VC's "this kid will make you tons of money if you support his work." The VC's who know that this professor has been right many times before will trust this judgement. So kids with this kind of dream go to Stanford and work to impress their professors in a particular kind of way, because it puts them on a fast track to a particular kind of success.

The ambitious students most cultivated by professors in Boston have a different kind of dream: they might dream of cracking strong AI, or discovering the essential properties of programming languages that will enable fault-tolerant or parallel programming, or really understanding the calculus of lambda calculus, or revolutionizing personal genomics, or building the foundations of Bladerunner-style synthetic biology. If professors are sufficiently impressed with their student's energy and talent, they will open their Rolodex of program managers at DARPA (and NSF and NIH), and tell them "what this kid is doing isn't practical or lucrative now, nor will it be for many years to come, but nonetheless it is critical for the future economic and military competitiveness of the US that this work is supported." The program managers' who know that this professor has been right many times before will trust this judgment. In this way, the kid is put on a fast track to success -- but it is a very different kind of success than the Stanford kid was looking for, and a different kind of kid who will fight to get onto this track. The meaning of success is very different, much more intellectual and much less practical, at least in the short term.

That's what I mean when I say "Boston" is less interested in business, more impractical, less entrepreneurial. It isn't that there aren't plenty of people here who have these qualities. But the "ecosystem" that gives ultra-ambitious young people the chance to do something singular which could be done no-where else -- an ecosystem which that it does have, but in a very different kind of way -- doesn't foster skill at commercialization or an interest in the immediate practical application of technology.

Maybe there is nothing wrong with that: Boston's ecosystem just fosters a different kind of achievement. However, I can see it is frustrating to the mayor of Boston, because the young people whose ambitions are enabled by Boston's ecosystem may be doing work crucial to the economic and military competitiveness of the US in the long term, but they might not help the economy of Boston very much! What often happens in the "long term" is that the work supported by grants in Boston develops to the point it becomes practical and lucrative, and then it gets commercialized in California, Seattle, New York, etc... The program managers at DARPA who funded the work are perfectly happy with this outcome, but I can imagine that the mayor of Boston is not! The kid also might not be 100% happy with this deal, because the success which he is offered isn't much like SV success -- its a fantastic amount of work, rather hermit-like and self-abnegating, which mostly ends up making it possible for other people far away to get very, very rich using the results of his labors. At best he sees only a minuscule slice of the wealth he enabled.

What one might want instead is that the professors in Boston have two sections in their Rolodex. The first section has the names of all the relevant program managers at DARPA, and the professor flips to this section first. The second section has the names of suitable cofounders, and friendly investors, and after the student has slaved away for five to seven years making a technology practical, the professor flips to the second section and sets the student up a second time to be the chief scientist or something like that at an appropriate startup.

And its not like this doesn't happen. It does happen. But it doesn't happen as much as it could, and I think the reason why it doesn't may be that it just takes a lot of work to maintain a really good Rolodex. These professors are busy and they just don't have enough energy to be the linchpin of a really top-quality ecosystem in two different ways at the same time.

If the mayor of Boston is upset that Boston is economically getting the short end of the stick in this whole deal (which I think it is), a practical thing he could do is give these professors some help in beefing up the second section of their Rolodex, or perhaps try to build another network of mentors which was in the appropriate way Rolodex-enabled. If he took the later route, he should understand that this second network shouldn't try to be a clone of the similar thing at Stanford (because at best it would only be a pale shadow) but instead be particularly tailored to incorporating the DARPA-project graduates that are unique to Boston's ecosystem. That way he could make Boston a center of entrepreneurship in a way that was uniquely its own and not merely a wannabe version of something else -- which it would inevitably do badly. That's what I meant when I said Boston should be itself better, rather than trying to be a poor pale copy of Silicon Valley.

Piaw: I like that line of thought Rebecca. Here's the counter-example: Facebook. Facebook clearly was interested in monetizing something that was very developed, and in fact, had been tried and failed many times because the timing wasn't right. Yet Facebook had to go to Palo Alto to get funding. So the business culture has to change sufficiently that the people with money are willing to risk it on very high risk ventures like the Facebook that was around 4 years ago.

Having invested my own money in startups, I find that it's definitely something very challenging. It takes a lot to convince yourself that this risk is worth taking, even if it's a relatively small portion of your portfolio. To get enough people to build critical mass, you have to have enough success in prior ventures to gain the kind of confidence that lets you fund Facebook where it was 4 years ago. I don't think I would have been able to fund Google or Facebook at the seed stage, and I've lived in the valley and worked at startups my entire career, so if anyone would be comfortable with risk, it should be me.

Dan: Rebecca: a side note on "opening a rolodex for DARPA". It doesn't really work quite like that. It's more like "hey, kid, you should go to grad school" and you write letters of recommendation to get the kid into a top school. You, of course, steer the kid to a research group where you feel he or she will do awesome work, by whatever biased idea of awesomeness.

My own professorial take: if one of my undergrads says "I want to go to grad school", then I do as above. If he or she says "I want to go work for a cool startup", then I bust out the VC contacts in my rolodex.

Rebecca: Dan: I know. I was oversimplifying for dramatic effect, just because qualifying it would have made my story longer, and it was already pushing the limits of the reasonable length for a comment. Of course the SV version of the story isn't that simple either.

I have seen it happen that sufficiently brilliant undergraduates (and even high school students -- some amazing prodigies show up at MIT) can get direct support. But realize also I'm really talking about grad students -- after all, my comparison is with the relationship between the Google guys and Rajeev Motwani, which happened when they were graduate students. The exercise was to compare the opportunities they encountered with the opportunities similarly brilliant, energetic and ultra-ambitious students at MIT would have access to, and talk about how it would be similar and different. Maybe I shouldn't have called such people "kids," but it simplified and shortened my story, which was pushing its length limit anyway. Thanks for the feedback; I'm testing out this story on you, and its useful to know what ways of saying things work and what doesn't.

Rebecca: Piaw: I understand that investing in startups by individual is very scary. I know some Boston angels (personally more than professionally) and I hear stories about how cautious their angel groups are. I should explain some context: the Boston city government recently announced a big initiative to support startups in Boston, and renovate some land opened up by the Big Dig next to some decaying seaport buildings to create a new Innovation District. I was thinking about what they could do to make that kind of initiative a success rather than a painful embarrassment (which it could easily become). So I was thinking about the investment priorities of city governments, more than individual investors like you.

Cities invest in all sorts of crazy things, like Olympic stadiums, for instance, that lose money horrifyingly ... but when you remember that the city collects 6% hotel tax on every extra visitor, and benefits from extra publicity, and collects extra property tax when new people move to the city, it suddenly doesn't look so bad anymore. Boston is losing out because there is a gap in the funding of technology between when DARPA stops funding something, because it is developed to the point where it is commercializable, and when the cautious Boston angels will start funding something -- and other states step into the gap and get rich off of the product of Massachusetts' tax dollars. That can't make the local government too happy.

Maybe the Boston city or state government might have an incentive to do something to plug that hole. They might be more tolerant of losing money directly because even a modestly lucrative venture, or one very, very slow to generate big returns, which nonetheless successfully drew talent to the city would make them money in hotel & property tax, publicity etc. etc. -- or just not losing the huge investment they have already made in their universities! I briefly worked for someone who was funded by Boston Community Capital, an organization which, I think, divided its energies between developing low income housing and and funding selected startups that were deemed socially redeeming for Boston. When half your portfolio is low-income housing, you might have a different outlook on risk and return! I was hugely impressed by what great investors they were -- generous, helpful & patient. Patience is necessary for us because the young prodigies in Boston go into fields whose time horizon is so long -- my friends are working on synthetic biology, but it will be a long, long time before you can buy a Bladerunner-style snake!

Again, thanks for the feedback. You are helping me understand what I am not making clear.

Marya: Rebecca, you said The idea I'm really dancing around is that these terms & expectations are very different at MIT than (I've heard) they are at Stanford

I read your initial comments as being about general business conditions for startups in Boston. But now I think you're mainly talking about internet startups or at least startups that are based around work in computer science. You're saying MIT's computer science department in particular does a poor job of pointing students in an entrepreneurial direction, because they are too oriented towards academic topics.

Both MIT and Stanford have top computer science and business school rankings. Maybe the problem is that Stanford's business school is more inclined to "mine" the computer science department than MIT's?

Doug: Rebecca, your description of MIT vs. Stanford sounds right to me (though I don't know Stanford well). What's interesting is that I remember UC Berkeley as being very similar to how you describe MIT: the brightest/most ambitious students at Cal ended up working on BSD or Postgres or The Gimp or Gnutella, rather than going commercial. Well, I haven't kept up with Berkeley since the mid-90s, but have there been any significant startups there since Berkeley Softworks?

Piaw: Doug: Inktomi. It was very significant for its time.

Dan: John Ousterhout built a company around Tcl. Eric Allman built a company around sendmail. Mike Stonebreaker did Ingres, but that was old news by the time the Internet boom started. Margo Seltzer built a company around Berkeley DB. None of them were Berkeley undergrads, though Seltzer was a grad student. Insik Rhee did a bunch of Internet-ish startup companies, but none of them had the visibility of something like Google or Yahoo.

Rebecca: Dan: I was thinking more about what you said about not involving undergraduates, but instead telling them to go to grad school. Sometimes MIT is in the nice sedate academic mode which steers undergrads to the appropriate research group when they are ready to work on their PhD. But sometimes it isn't. Let me tell you more about the story of the scene in the computer club concerning installation of the first web server. It was about the 100th web server anywhere, and its maintainer accosted me with an absurd chart "proving" the exponential growth of the web -- i.e. a graph going exponentially from 0 to 100ish, which he extrapolated forward in time to over a million -- you know the standard completely bogus argument -- except this one was exceptionally audacious in its absurdity. Yet he argued for it with such intensity and conviction, as if he was saying that this graph should convince me to drop everything and work on nothing but building the Internet, because it was the only thing that mattered!

I fended him off with the biggest stick I could find: I was determined to get my money's worth for my education, do my psets, get good grades (I cared back then), and there is no way I would let that be hurt by this insane Internet obsession. But it continued like that. The Internet crowd only grew with time, and they got more insistent that they were working on the only thing that mattered and I should drop everything and join them. That I was an undergraduate did not matter a bit to anyone. Undergrads were involved, grad students were involved, everyone was involved. It wasn't just a research project; eventually so many different research projects blended together that it became a mass obsession of an entire community, a total "Be Involved or Be Square" kind of thing. I'd love to say that I did get involved. But I didn't; I simply sat in the office on the couch and did psets, proving theorems and solving the Schrodinger's equation, and fended them off with the biggest stick I could find. I was determined to get a Real Education, to get my money's worth at MIT, you know.

My point is that when the MIT ecosystem really does its thing, it is capable of tackling projects that are much bigger than ordinary research projects, because it can get a critical mass of research projects working together, involving enough grad students and also sucking in undergrads and everyone else, so that the community ends up with an emotional energy and cohesion that goes way, way beyond the normal energy of a grad student trying to finish a PhD.

There's something else too, though I cannot report on this with that much certainty, because was too young to see it all at the time. You might ask: if MIT had this kind of emotional energy focused on something in the 90's, then what is it doing in a similar way now? And the answer I'd have to say, painfully, is that it is frustrated and miserable about being an empty shell of what it once was.

Why? Because in 2000 Bush got elected and he killed the version of DARPA with which so many professors had had such a long relationship. I didn't I understand this in the 90's -- like a kid I took the things that were happening around me for granted without seeing the funding that made them possible -- but now I see that that the kind of emotional energy expended by the Internet crowd at MIT in the 90's costs a lot of money, and needs an intelligent force behind it, and that scale of money and planning can only come from the military, not from NSF.

More recently I've watched professors who clearly feel it is their birthright to be able to mobilize lots of student to do really large-scale projects, but then they try to find money for it out of NSF, and they spend all their time killing themselves writing grant proposals, never getting enough money to make themselves happy, and complaining about the cowardice of academia, and wishing they could still work with their old friends at DARPA. They aren't happy because they are merely doing big successful research projects, but a mere research project isn't enough... when MIT is really MIT it can do more. It is an empty shell of itself when it is merely a collection of merely successful but not cohesive NSF funded research projects. As I was saying, the Boston "ecosystem" has in itself the ability to do something singular, but it is singular in an entirely different way than SV's thing.

This may seem obscure, a tale of funding woes at a distant university, but perhaps it is something you should be aware of, because maybe it affects your life. The reason you should care is that when MIT was fully funded and really itself, it was building the foundations of the things that are now making you rich.

One might think of the relationship between technology and wealth like a story about potential energy: when you talk about finding a "product/market" fit, its like pushing a big stone up a hill, until you get the "fit" at the top of the hill, and then the stone rolls down and the energy you put into it spins out and generates lots of money. In SV you focus on pushing stones up short hills -- like Piaw said, no more than 12-18 months of pushing before the "fit" happens.

But MIT in its golden age could tackle much, much bigger hills -- the whole community could focus itself on ten years of nothing but pushing a really big stone up a really big hill. The potential energy that the obsessed Internet Crowd in the 90's was pushing into the system has been playing out in your life ever since. They got a really big stone over a really big hill and sent it down onto you, and then you pushed it over little bumps on the way down, and made lots of money doing it, and you thought the potential energy you were profiting from came entirely from yourselves. Some of it was, certainly, but not all. Some of it was from us. If we aren't working on pushing up another such stone, if we can't send something else over a huge hill to crash into you, then the future might not be like the past for you. Be worried.

So you might ask, how did this story end? If I'm claiming that there was intense emotional energy being poured into developing the Internet at MIT in the 90's, why didn't those same people fan out and create the Internet industry in Boston? If we were once such winners, how did we turn into such losers? What happened to this energetic, cohesive group?

I can tell you about this, because after years of fending off the emotional gravitation pull of this obsession, towards the end I began to relent. First I said "No way!" and then I said "No!" and then I said "Maybe Later," and then I said "OK, Definitely Later"... and then when I finally got around to Later, and (perhaps the standard story of my life) Later turned out to be Too Late. By 2000 I was ready to join the crowd and remake myself as an Internet Person in the MIT style. So I ended up becoming seriously involved just at the time it fell apart. Because 2000ish, almost the beginning of the Internet Era for you, was the end for us.

This weekend I was thinking of how to tell this story, and I was composing it in my head in a comic style, thinking to tell a story of myself as "Parable of Boston Loser" to talk about all my absurd mistakes as a microcosm of the difficulties of a whole city. I can pick on myself, can't I; no one will get upset at that? The short story is that in 2000ish the Internet crowd had achieved their product/market fit, DARPA popped the champagne -- you won guys! Congratulations! Now go forth and commercialize! -- and pushed us out of the nest into the big world to tackle the standard tasks of commercializing a technology -- the tasks that you guys can do in your sleep. I was there, right of the middle of things, during that transition. I thought to tell you a comic story about the absurdity of my efforts in that direction, and make you laugh at me.

But when I was trying to figure out how to explain what was making it so terribly hard for me, to my great surprise I was suddenly crying really hard. All Saturday night I was thinking about it and crying. I had repressed the memory, decided I didn't care that much -- but really it was too terrible to face. All the things you can do without thinking, for us hurt terribly. The declaration of victory, the "achievement of product/market fit", the thing you long for more than anything, I -- and I think many of the people I knew -- experienced as a massive trauma. This is maybe why I've reacted so vehemently and spammed your comment field, because I have big repressed personal trauma about all this. I realized I had a much more earnest story to tell than I had previously planned.

For instance, I was reflecting on my previous comment about what cities spend money on, and thinking that I sounded like the biggest jerk ever. Was I seriously suggesting that the city take money that they would have spent on housing for poor black babies and instead spend it on overeducated white kids with plenty of other prodigiously lucrative economic opportunities? Where do I get off suggesting something like that? If I really mean it I have a big, big burden of proof.

So I'll try to combine my more earnest story with at least a sketch of how I'd tackle this burden of proof (and try to keep it short, to keep the spam factor to a minimum. The javascript is getting slow, so I'll cut this here and continue.)

Ruchira: Interlude (hope Rebecca continues soon!): Rebecca says "that scale of money and planning can only come from the military, not from NSF." Indeed, it may be useful to check out this NY Times infographic of the federal budget: http://www.nytimes.com/interactive/2010/02/01/us/budget.html

I'll cite below some of the 2011 figures from this graphic that were proposed at that time; although these may have changed, the relative magnitudes of one sector versus another are not very different. I've mostly listed sectors in decreasing order of budget size for research, except I listed "General science & technology" sector (which includes NSF) before "Health" sector (which includes NIH) since Rebecca had contrasted the military with NSF.

The "Research, development, test, and evaluation" segment of the "National Defense" sector is $76.77B. I guess DARPA, ONR, etc. fit there.

The "General science & technology" sector is down near the lower right. The "National Science Foundation programs" segment gets $7.36B. There's also another $0.1B for "National Science Foundation and other". The "Science, exploration, and NASA supporting activities" segment gets $12.78B. (I don't know to what extent satellite technology that is relevant to the national defense is also involved here, or in the $4.89B "Space operations" segment, or in the $0.18B "NASA Inspector General, education, and other" segment.) The "Department of Energy science programs" segment gets $5.12B. The "Department of Homeland Security science and technology programs" segment gets $1.02B.

In the "Health" sector, the "National Institutes of Health" segment gets $32.09B. The "Disease control, research, and training" segment gets $6.13B (presumably this includes the CDC). There's also "Other health research and training" at $0.14B and "Diabetes research and other" at $0.095B.

In the "Natural resources and environment sector", the "National Oceanic and Atmospheric Administration" gets $5.66B. "Regulatory, enforcement, and research programs" gets $3.86B (is this the entire EPA?).

In the "Community and regional development" sector, the "National Infrastructure Innovation and Finance fund" (new this year) gets $4B.

In the "Agriculture" sector, which presumably includes USDA-funded research, "Research and education programs" gets $1.97B, "Research and statistical analysis" gets $0.25B, and "Integrated research, education, and extension programs" gets $0.025B.

In the "Transportation" sector, "Aeronautical research and technology" gets $1.15B, which by the way would be a large (130%) relative increase. (Didn't MIT find a way of increasing jet fuel efficiency by 75% recently?)

In the "Commerce and housing credit" sector, "Science and technology" gets $0.94B. I find this rather mysterious.

In the "Education, training, employment" sector, "Research and general education aids: Other" gets $1.14B. The "Institute for Education Sciences" gets $0.74B.

In the "Energy" sector, "Nuclear energy R&D" gets $0.82B and "Research and development" gets $0.024B (presumably this is the portion outside the DoE).

In the "Veterans' benefits and services" sector, "Medical and prosthetic research" gets $0.59B.

In the "Income Security" sector there's a tiny segment "Children's research and technical assistance" $0.052B. Not sure what that means.

Rebecca: I'll start with a non-sequitur which I hope to use to get at the hear of the difference between MIT and Stanford: recently I was at a Marine publicity event and I asked the recruiter what differentiates the Army from the Marines? Since they both train soldiers to fight, why don't they do it together? He answered vehemently that they must be separate because of one simple attribute in which they are utterly opposed: how they think about the effect they want to have on the life their recruits have after they retire from the service. He characterized the Army as an organization which had two goals: first, to train good soldiers, and second, to give them skills that would get them a good start in the life they would have after they left. If you want to be a Senator, you might get your start in the Army, get connections, get job skills, have "honorable service" on your resume, and generally use it to start your climb up the ladder. The Army aspires to create a legacy of winners who began their career in the Army.

By contrast the Marines, he said, have only one goal: they want to create the very best soldiers, the elite, the soldiers they can trust in the most difficult and dangerous situations to keep the Army guys behind them alive. This elite training, he said, comes with a price. The price you pay is that the training you get does not prepare you for anything at all in the civilian world. You can be the best of the best in the Marines, and then come home and discover that you have no salable civilian job skills, that you are nearly unemployable, that you have to start all over again at the bottom of the ladder. And starting over is a lot harder than starting the first time. It can be a huge trauma. It is legendary that Marines do not come back to civilian life and turn into winners: instead they often self-destruct -- the "transition to civilian life" can be violently hard for them.

He said this calmly and without apology. Did I say he was a recruiter? He said vehemently: "I will not try to recruit you! I want to you to understand everything about how painful a price you will pay to be a Marine. I will tell you straight out it probably isn't for you! The only reason you could possibly want it is because you want more than anything to be a soldier, and not just to be a soldier, but to be in the elite, the best of the best." He was saying: we don't help our alumni get started, we set them up to self-destruct, and we will not apologize for it -- it is merely the price you pay for training the elite!

This story gets to the heart of what I am trying to say is the essential difference between Stanford and MIT. Stanford is like the Army: for its best students, it has two goals -- to make them engineers, and to make them winners after they leave. And MIT is like the Marines: it has only one goal -- to make its very best student into the engineering elite, the people about whom they can truthfully tell program managers at DARPA: you can utterly trust these engineers with the future of America's economic and military competitiveness. There is a strange property to the training you get to enter into that elite, much like the strange property the non-recruiter attributed to the training of the Marines: even though it is extremely rigorous training, once you leave you can find yourself utterly without any salable skills whatever.

The skills you need to acquire to build the infrastructure ten years ahead of the market's demand for it may have zero intersection with the skills in demand in the commercial world. Not only are you not prepared to be a winner, you may not even be prepared to be basically employable. You leave and start again at the bottom. Worse than the bottom: you may have been trained with habits commercial entities find objectionable (like a visceral unwillingness to push pointers quickly, or a regrettable tendency to fight with the boss before the interview process is even over.) This can be fantastically traumatic. Much as ex-Marines suffer a difficult "transition to civilian life," the chosen children of MIT suffer a traumatic "transition to commercial life." And the leaders at MIT do not apologize for this: as the Marine said, it is just the price you pay for training the elite.

This is the general grounds which I might use to appeal to the city officials in Boston. There's more to explain, but the shape of the idea would be roughly this: much a cities often pay for programs to help ex-Marines transition to civilian life, on the principal that they represent valuable human capital that ought not to be allowed to self-destruct, it might pay off for the city to understand the peculiar predicament of graduates of MIT's intense DARPA projects, and provide them with help with the "transition to commercial life." There's something in it for them! Even though people who know nothing but how to think about the infrastructure of the next decade aren't generically commercially valuable, if you put them in proximity to normal business people, their perspective would rub off in a useful way. That's the way that Boston could have catalyzed an Internet industry of its own -- not by expecting MIT students to commercialize their work, which (with the possible exception of Philip) they were constitutionally incapable of, but by giving people who wanted to commercialize something but didn't know what a chance to learn from the accumulated (nearly ten years!) of experience and expertise of the Internet Crowd.

On that note, I wanted to say -- funny you should mention Facebook. You think of Mark Zuckerberg as the social networking visionary in Boston, and Boston could have won if they had paid to keep him. I think that strange -- Zuckerberg is fundamentally one of you, not one of us. It was right he should leave. But I'll ask you a question you've probably never thought about. Suppose the Internet had not broken into the public consciousness at the time it did; suppose the world had willfully ignored it for a few more years, so the transition from a DARPA-funded research project to a commercial proposition would have happened a few years later. There was an Internet Crowd at MIT constantly asking DARPA to let them build the "next thing," where "next" is defined as "what the market will discover it wants ten years from now." So if this crowd had gotten a few more years of government support, what would they have built?

I'm pretty sure it would have been a social networking infrastructure, not like Facebook, really, but more like the Diaspora proposal. I'm not sure, but I remember in '98/'99 that's what all the emotional energy was pointing toward. It wasn't technically possible to build yet, but the instant it was that's what people wanted. I think it strange that everyone is talking about social networking and how it should be designed now; it feels to me like deja vu all over again, and echo from a decade ago. If the city or state had picked up these people after DARPA dropped them, and given them just a little more time, a bit more government support -- say by a Mass ARPA -- they could have made Boston the home, not of the big social networking company, but of the open social networking infrastructure and and all the expertise and little industries such a thing would have thrown off. And it would have started years and years ago! That's how Boston could have become a leader by being itself better, rather than trying to be you badly.

Dan: I think you're perhaps overstating the impact of DARPA. DARPA, by and large, funds two kinds of university activities. First, it funds professors, which pays for post-docs, grad students, and sometimes full-time research staff. Second, DARPA also funds groups that have relatively little to do with academia, such as the BSD effort at Berkeley (although I don't know for a fact that they had DARPA money, they didn't do "publish or perish" academic research; they produced Berkeley Unix).

Undergrads at a place like MIT got an impressive immersion in computer science, with a rigor and verve that wasn't available most other places (although Berkeley basically cloned 6.001, and others did as well). They call it "drinking from a firehose" for a reason. MIT, Berkeley, and other big schools of the late 80's and early 90's had more CS students than they knew what to do with, so they cranked up the difficulty of the major and produced very strong students, while others left for easier pursuits.

The key inflection point is how popular culture at the university, and how the faculty, treat their "rock star" students. What are the expectations? At MIT, it's that you go to grad school, get a PhD, become a researcher. At Stanford, it's that you run off and get rich.

The decline in DARPA funding (or, more precisely, the micromanagement and short-term thinking) in recent years can perhaps be attributed to the leadership of Tony Tether. He's now gone, and the "new DARPA" is very much planning to come back in a big way. We'll see how it goes.

One last point: I don't buy the Army vs. Marines analogy. MIT vs. Stanford train students similarly, in terms of their preparation to go out and make money, and large numbers of MIT people are quite successfully out there making money. MIT had no lack of companies spin out of research there, notably including Akamai. The differences we're talking about here are not night vs. day, they're not Army vs. Marines. They're more subtle but still significant.

Rebecca: Yes, I've been hearing about the "unTethered Darpa." I should have mentioned that, but left it out to stay (vaguely) short. And yes, I am overstating to make it possible to make a simple statement of what I might be asking for that would be couched in terms a city or state government official might be able to relate to. Maybe that's irresponsible; that's why I'm testing it on you first, to give you a chance to yell at me and tell me if you think that's so.

They are casting about for a narrative of why Boston ceded its role as leaders of the Internet industry to SV, that would point them to something to do about it. So I was talking specifically about the sense in which Boston was once a leader in internet technology and the weaknesses that might have caused it to lose its lead. Paul Graham says that Boston has the weakness in developing industries that it is "too good" at other things, so I wanted to tell a dramatized story specifically about what the other things were and why that would lead to fatal weakness -- how being "too strong" in a particular way can also make you weak.

I certainly am overstating, but perhaps I am because I am trying to exert force against another prediliction I find pernicious: the tendency to be eternally vague about the internal emotional logic that makes things happen in the world. If people build a competent, cohesive, energetic community, and then it suddenly fizzles, fails to achieve its potential, and disbands, it might be important to know what weakness caused this surprising outcome so you know how to ask for the help that would keep it from happening the next time.

And to tell the truth, I'm not sure I entirely trust your objection. I've wondered why so often I hear such weak, vague narratives about the internal emotional logic that causes things to happen in the world. Vague narratives make you helpless to solve problems! I don't cling to the right to overstate things, but I do cling to the right to sleuth out the emotional logic of cause and effect that drives the world around me. I feel sometimes that I am fighting some force that wants to thwart me in that goal -- and I suspect that that force sometimes originates, not always in rationality, but in in a male tendency to not want to admit to weakness just for the sake of "seeming strong." A facade of strength can exact a high price in the currency of the real competence of the world, since often the most important action that actually makes the world better is the action of asking for help. I was really impressed with that Marine for being willing to admit to the price he paid, to the trauma he faced. That guy didn't need to fake strength! So maybe I am holding out the image of him as an example. We have government officials who are actively going out of their way to offer to help us; we have a community that accomplishes many of its greatest achievements because of government support; we shouldn't squander an opportunity to ask for what might help us. And this narrative might be wrong; that's why I'm testing it first. I'm open to criticism. But I don't want to pass by an opportunity, an opening to ask for help from someone who is offering it, merely because I'm too timid to say anything for the fear of overstatement.

Dan: Certainly, Boston's biggest strength is the huge number of universities in and around the area. Nowhere else in the country comes close. And, unsurprisingly, there are a large number of high-tech companies in and around Boston. Another MIT spin-out I forgot to mention above is iRobot, the Roomba people, which also does a variety of military robots.

To the extent that Boston "lost" the Internet revolution to Silicon Valley, consider the founding of Netscape. A few guys from Illinois and one from Kansas. They could well have gone anywhere. (Simplifying the story, but) they hooked up with a an angel investor (Jim Clark) and he draged them out to the valley where they promptly hired a bunch of ex-SGI talent and hit the road running. Could they have gone to Boston? Sure. But they didn't.

What seems to be happening is that different cities are developing their own specialties and that's where people go. Dallas, for example, has carved out a niche in telecom, and all the big players (Nortel, Alcatel, Cisco, etc.) do telecom work there. In Houston, needless to say, it's all about oilfield engineering. It's not that there's any particular Houston tax advantage or city/state funding that brings these companies here. Rather, the whole industry (or, at least the white collar part of it) is in Houston, and many of the big refineries are close nearby (but far enough away that you don't smell them).

Greater Boston, historically, was where the minicomputer companies were, notably DEC and Data General. Their whole world got nuked by workstations and PCs. DEC is now a vanishing part of HP and DG is now a vanishing part of EMC. The question is what sort of thing the greater Boston area will become a magnet for, in the future, and how you can use whatever leverage you've got to help make it happen. Certainly, there's no lack of smart talent graduating from Boston-area universities. The question is whether you can incentivize them to stay put.

I'd suggest that you could make headway, that way, by getting cheap office space in and around Cambridge (an "incubator") plus building a local pot of VC money. I don't think you can decide, in advance, what you want the city's specialty to be. You pretty much just have to hope that it evolves organically. And, once you see a trend emerging, you might want to take financial steps to reinforce it.

Thomas: BBN (which does DARPA funded research) has long been considered a halfway house between MIT and the real world.

Piaw: It looks like there's another conversation about this thread over at Hacker News: http://news.ycombinator.com/item?id=1416348 I love conversation fragmentation.

Doug: Conversation fragmentation can be annoying, but do you really want all those Hacker News readers posting on this thread?

Piaw: Why not? Then I don't have to track things in two places.

Ruchira: hga over at Hacker News says: "Self-selection by applicants is so strong (MIT survived for a dozen year without a professional as the Director), whatever gloss the Office is now putting on the Institute, it's able to change things only so much. E.g. MIT remains the a place where you don't graduate without taking (or placing out of) a year of the calculus and classical physics (taught at MIT speed), for all majors."

Well, the requirements for all majors at Caltech are: two years of calculus, two years of physics (including quantum physics), a year of chemistry, and a year of biology (the biology requirement was added after I went there); freshman chemistry lab and another introductory lab; and a total of four years of humanities and social sciences classes. The main incubator I know of near Caltech is the Idealab. Certainly JPL (the Jet Propulsion Laboratory) as well as Hollywood CGI and animation have drawn from the ranks of Caltech grads. The size of the Caltech freshman class is also much smaller than those at Stanford or MIT.

I don't know enough to gauge the relative success of Caltech grads at transitioning to local industry, versus Stanford or MIT, does anyone else?

Rebecca: The comments are teaching me what I didn't make clear, and this is one of the worst ones. When I talked about the "transition to the commercial world" I didn't mainly mean grads transitioning to industry. I was thinking more about the transition that a project goes through when it achieves product/market fit.

This might not be something that you think of as such a big deal, because when companies embark on projects, they usually start with a fairly specific plan of the market they mean to tackle and what they mean to do if and when the market does adopt their product. There is no difficult transition because they were planning for it all along. After all, that's the whole point of a company! But a ten year research project has no such plan. The web server enthusiast did not know when the market would adopt his "product" -- remember, browsers were still primitive then -- nor did he really know what it would look like when they did. Some projects are even longer term than that: a programming language professor said that the expected time from the conception of a new programming language idea to its widespread adoption is thirty years. That's a good chunk of a lifetime.

When you've spent a good bit of your life involved with something as a research project that no-one besides your small crowd cares about, when people do notice, when commercial opportunities show up, when money starts pouring out of the sky, its a huge shock! You haven't planned for it at all. Have you heard Philip's story of how he got his first contract for what became ArsDigita? I couldn't find the story exactly, but it was something like this: he had posted some of the code for his forum software online, and HP called him up and asked him to install and configure it for them. He said "No! I'm busy! Go away!" They said "we'll pay you $100,000." He's in shock: “You'll give me $100000 for 2 weeks of work?”

He wasn't exactly planning for money to start raining down out of the sky. When he started doing internet applications, he said, people had told him he was crazy, there was no future in it. I remember when I first started seeing URL's in ads on the side of buses, and I was just bowled over -- all the time my friends had been doing web stuff, I had never really believed they would ever be adopted. URL's are just so geeky, after all! I mean, seriously, if some wild-eyed nerd told you that in five years people would print "http://",on the side of a bus, what would you think? I paid attention to what they were doing because they thought it was cool, I thought it was cool, and the fact that I had no real faith anyone else ever would made no difference. So when the world actually did, it was entering a new world that none of us were prepared for, that nobody had planned for, that we had not given any thought to developing skills to be able to deal with. I guess this is a little hard to convey, because it wouldn't happen in a company. You wouldn't ever do something just because you thought it was cool, without any faith that anyone would ever agree with you, and then get completely caught by surprise, completely bowled over, when the rest of the world goes crazy about what you thought was your esoteric geeky obsession.

Piaw: I think we were all bowled over by how quickly people started exchanging e-mail addresses, and then web-sites, etc. I was stunned. But it took a really long time for real profits to show up! It took 20 or so search engine companies to start up and fail before someone succeeded!

Rebecca: Of course; you are bringing up what was in fact the big problem. The question was: in what mode is it reasonable to ask the local government for help? And if you are in the situation where $100,000 checks are raining on you out of the sky without you seeming to make the slightest effort to even solicit them, then it seems like only the biggest jerk on the planet would claim to the government that they were Needy and Deserving. Black babies without roofs on their heads are needy and deserving; rich white obnoxious nerds with money raining down on them are not. But remember though Philip doesn't seem to be expending much effort in his story, he also said in the late 90's that he had been building web apps for ten years. Who else on the planet in 1999 could show someone a ten year long resume of web app development?

As Piaw said, it isn't like picking up the potential wealth really was just a matter of holding out your hand as money rained from the sky. Quite the contrary. It wasn't easy; in fact it was singularly difficult. Sure, Philip talked like it was easy, until you think about how hard it would have been to amass the resume he had in 1999.

When the local government talks about how it wants to attract innovators to Boston, to turn the city into a Hub of Innovation, my knee-jerk reaction is -- and what are we, chopped liver? But then I realize that when they say they want to attract innovators, what they really mean is not that they want innovators, but that they want people who can innovate for a reasonable, manageable amount of time, preferably short, and then turn around, quick as quicksilver, and scoop up all the return on investment in that innovation before anyone else can get at it -- and give a big cut in taxes to the city and state! Those are the kind of innovators who are attractive! Those are the kind who properly make your Boston the kind of Hub of Innovation the Mayor of Boston wants it to be. Innovators like those in Tech Square or Stata, not so much. We definitely qualify for the Chopped Liver department.

And this hurts. It hurts to think that the Mayor of Boston might be treating us with more respect now if we had been better in ~2000 at turning around, quick as quicksilver, and remaking ourselves into people who could scoop up all, or some, or even a tiny fraction of the return on investment of the innovation at which we were then, in a technical sense, well ahead of anyone else. But remaking yourself is not easy! Especially when you realize that the state from which we were remaking ourselves was sort of like the Marines -- a somewhat ascetic state, one that gave you the nerd equivalent of military rations, a tent, maybe a shower every two weeks, and no training in any immediately salable skills whatsoever -- but also one that also gave you a community, an identity, a purpose, a sense of who you were that you never expected to change. But all of a sudden we "won," and all of a sudden there was a tremendous pressure to change. It was like being thrown in the deep end of the pool without swim lessons, and yes we sank, we sank like a stone with barely a dog paddle before making a beeline for the bottom. So we get no respect now. But was this a reasonable thing to expect? What does the mayor of Boston really want? Yes, the sense in which Boston is a Hub of Innovation (for it already is one, it is silly for it to try to become what it already is!) is problematic and not exactly what a Mayor would wish for. I understand his frustration. But I think he would do better to work with his city for what it is, in all its problematic incompetence and glory, than to try to remake it in the image of something else it is not.

Rebecca: On the subject of Problematic Innovators, I was thinking back to the scene in the computer lab where everyone agreed that hoarding domain names was the dumbest idea they had ever heard of. I'm arguing that scooping up return on the investment in innovation was hard, but registering a domain name is the easiest thing in the world. I think they were free back then, even. If I remember right, they started out free, and then Procter & Gamble registered en-mass every name that had even the vaguest entomological relation with the idea of "soap," at which point the administrators of the system said "Oops!" and instituted registration fees to discourage that kind of behavior -- which, of course, would have done little to deter P&G. They really do want to utterly own the concept of soap. (I find it amusing that P&G was the first at bat in the domain name scramble -- they are not exactly the world's image of a cutting-edge tech-savvy company -- but when it comes to the problem of marketing soap, they quietly dominate.)

How can I can explain that we were not able to expend even the utterly minimal effort in capturing the return on investment in innovation of registering a free domain name, so as to keep the resulting tax revenues in Massachusetts?

Thinking back on it, I don't think it was either incapacity, or lack of foresight, or a will to fail in our duty as Boston and Massachusetts taxpayers. It was something else: it was almost a "semper fidelis"-like group spirit that made it seem dishonorable to hoard a domain name that someone else might want, just to profit from it later. Now one might ask, why should you refrain from hoarding it sooner just so that someone else could grab it and hoard it later? That kind of honor doesn't accomplish anything for anyone!

But you have to realize, this was right at the beginning, when the domain name system was brand new and it wasn't at all clear it would be adopted. These were the people who were trying to convince the world to accept this system they had designed and whose adoption they fervently desired. In that situation, honor did make a difference. It wouldn't look good to ask the world to accept a naming system with all the good names already taken. You actually noticed back then when something (like "soap") got taken -- the question wasn't what was available, the question was what was taken, and by whom. You'd think it wouldn't hurt too much to take one cool name: recently I heard that someone got a $38 million offer for "cool.com." That's a lot of money! -- would it have hurt that much to offer the world a system with all the names available except, you know, one cool one? But there was a group spirit that was quite worried that once you started down that slope, who knew where it would lead?

There were other aspects of infrastructure, deeper down, harder to talk about, where this group ethos was even more critical. You can game an infrastructure to make it easier to personally profit from it -- but it hurts the infrastructure itself to do that. So there was a vehement group discipline that maintained a will to fight any such urge to diminish the value of the infrastructure for individual profit.

This partly explains why we were not able, when the time came, to turn around, quick as quicksilver, and scoop up the big profits. To do that would have meant changing, not only what we were good at, but what we thought was right.

When I think back, I wonder, why people weren't more scared? When we chose not to register "cool.com" or similar names, why didn't we think, life is hard, the future is uncertain, and money does really make a difference in what you can do? I think this group ethic was only possible because there was a certain confidence -- the group felt itself party to a deal: in return for being who we are, the government would take care of us, forever. Not until the time when the product achieved sufficient product/market fit that it became appropriate to expect return on investment. Forever.

This story might give a different perspective on why it hurts when the Mayor of Boston announces that he wants to make the city a Hub of Innovation. The innovators he already has are chopped liver? Well, its understandable that he isn't too pleased with the innovators in this story, because they aren't exactly a tax base. But that is the diametric opposition of the deal with the government we thought we had.

History of Symbolics lisp machines

2007-11-16 08:00:00

This is an archive of Dan Weinreb's comments on Symbolics and Lisp machines.

Rebuttal to Stallman’s Story About The Formation of Symbolics and LMI

Richard Stallman has been telling a story about the origins of the Lisp machine companies, and the effects on the M.I.T. Artificial Intelligence Lab, for many years. He has published it in a book, and in a widely-referenced paper, which you can find at http://www.gnu.org/gnu/rms-lisp.html.

His account is highly biased, and in many places just plain wrong. Here’s my own perspective on what really happened.

Richard Greenblatt’s proposal for a Lisp machine company had two premises. First, there should be no outside investment. This would have been totally unrealistic: a company manufacturing computer hardware needs capital. Second, Greenblatt himself would be the CEO. The other members of the Lisp machine project were extremely dubious of Greenblatt’s ability to run a company. So Greenblatt and the others went their separate ways and set up two companies.

Stallman’s characterization of this as “backstabbing”, and that Symbolics decided not “not have scruples”, is pure hogwash. There was no backstabbing whatsoever. Symbolics was extremely scrupulous. Stallman’s characterization of Symbolics as “looking for ways to destroy” LMI is pure fantasy.

Stallman claims that Symbolics “hired away all the hackers” and that “the AI lab was now helpless” and “nobody had envisioned that the AI lab’s hacker group would be wiped out, but it was” and that Symbolics “wiped out MIT”. First of all, had there been only one Lisp machine company as Stallman would have preferred, exactly the same people would have left the AI lab. Secondly, Symbolics only hired four full-time and one part-time person from the AI lab (see below).

Stallman goes on to say: “So Symbolics came up with a plan. They said to the lab, ‘We will continue making our changes to the system available for you to use, but you can’t put it into the MIT Lisp machine system. Instead, we’ll give you access to Symbolics’ Lisp machine system, and you can run it, but that’s all you can do.’” In other words, software that was developed at Symbolics was not given away for free to LMI. Is that so surprising? Anyway, that wasn’t Symbolics’s “plan”; it was part of the MIT licensing agreement, the very same one that LMI signed. LMI’s changes were all proprietary to LMI, too.

Next, he says: “After a while, I came to the conclusion that it would be best if I didn’t even look at their code. When they made a beta announcement that gave the release notes, I would see what the features were and then implement them. By the time they had a real release, I did too.” First of all, he really was looking at the Symbolics code; we caught him doing it several times. But secondly, even if he hadn’t, it’s a whole lot easier to copy what someone else has already designed than to design it yourself. What he copied were incremental improvements: a new editor command here, a new Lisp utility there. This was a very small fraction of the software development being done at Symbolics.

His characterization of this as “punishing” Symbolics is silly. What he did never made any difference to Symbolics. In real life, Symbolics was rarely competing with LMI for sales. LMI’s existence had very little to do with Symbolics’s bottom line.

And while I’m setting the record straight, the original (TECO-based) Emacs was created and designed by Guy L. Steele Jr. and David Moon. After they had it working, and it had become established as the standard text editor at the AI lab, Stallman took over its maintenance.

Here is the list of Symbolics founders. Note that Bruce Edwards and I had worked at the MIT AI Lab previously, but had already left to go to other jobs before Symbolics started. Henry Baker was not one of the “hackers” of which Stallman speaks.

  • Robert Adams (original CEO, California)
  • Russell Noftsker (CEO thereafter)
  • Minoru Tonai (CFO, California)
  • John Kulp (from MIT Plasma Physics Lab)
  • Tom Knight (from MIT AI Lab)
  • Jack Holloway (from MIT AI Lab)
  • David Moon (half-time as MIT AI Lab)
  • Dan Weinreb (from Lawrence Livermore Labs)
  • Howard Cannon (from MIT AI Lab)
  • Mike McMahon (from MIT AI Lab)
  • Jim Kulp (from IIASA, Vienna)
  • Bruce Edwards (from IIASA, Vienna)
  • Bernie Greenberg (from Honeywell CISL)
  • Clark Baker (from MIT LCS)
  • Chris Terman (from MIT LCS)
  • John Blankenbaker (hardware engineer, California)
  • Bob Williams (hardware engineer, California)
  • Bob South (hardware engineer, California)
  • Henry Baker (from MIT)
  • Dave Dyer (from USC ISI)

Why Did Symbolics Fail?

In a comment on a previous blog entry, I was asked why Symbolics failed. The following is oversimplified but should be good enough. My old friends are very welcome to post comments with corrections or additions, and of course everyone is invited to post comments.

First, remember that at the time Symbolics started around 1980, serious computer users used timesharing systems. The very idea of a whole computer for one person was audacious, almost heretical. Every computer company (think Prime, Data General, DEC) did their own hardware and their own software suite. There were no PCs’, no Mac’s, no workstations. At the MIT Artificial Intelligence Lab, fifteen researchers shared a computer with a .001 GHz CPU and .002 GB of main memory.

Symbolics sold to two kinds of customers, which I’ll call primary and secondary. The primary customers used Lisp machines as software development environments. The original target market was the MIT AI Lab itself, followed by similar institutions: universities, corporate research labs, and so on. The secondary customers used Lisp machines to run applications that had been written by some other party.

We had great success amongst primary customers. I think we could have found a lot more of them if our marketing had been better. For example, did you know that Symbolics had a world-class software development environment for Fortran, C, Ada, and other popular languages, with amazing semantics-understanding in the editor, a powerful debugger, the ability for the languages to call each other, and so on? We put a lot of work into those, but they were never publicized or advertised.

But we knew that the only way to really succeed was to develop the secondary market. ICAD made an advanced constraint-based computer-aided design system that ran only on Symbolics machines. Sadly, they were the only company that ever did. Why?

The world changed out from under us very quickly. The new “workstation” category of computer appeared: the Suns and Apollos and so on. New technology for implementing Lisp was invented that allowed good Lisp implementations to run on conventional hardware; not quite as good as ours, but good enough for most purposes. So the real value-added of our special Lisp architecture was suddenly diminished. A large body of useful Unix software came to exist and was portable amongst the Unix workstations: no longer did each vendor have to develop a whole software suite. And the workstation vendors got to piggyback on the ever-faster, ever-cheaper CPU’s being made by Intel and Motorola and IBM, with whom it was hard for Symbolics to keep up. We at Symbolics were slow to acknowledge this. We believed our own “dogma” even as it became less true. It was embedded in our corporate culture. If you disputed it, your co-workers felt that you “just didn’t get it” and weren’t a member of the clan, so to speak. This stifled objective analysis. (This is a very easy problem to fall into — don’t let it happen to you!)

The secondary market often had reasons that they needed to use workstation (and, later, PC) hardware. Often they needed to interact with other software that didn’t run under Symbolics. Or they wanted to share the cost of the hardware with other applications that didn’t run on Symbolics. Symbolics machines came to be seen as “special-purpose hardware” as compared to “general-purpose” Unix workstations (and later Windows PCs). They cost a lot, but could not be used for the wider and wider range of available Unix software. Very few vendors wanted to make a product that could only run on “special-purpose hardware”. (Thanks, ICAD; we love you!)

Also, a lot of Symbolics sales were based on the promise of rule-based expert systems, of which the early examples were written in Lisp. Rule-based expert systems are a fine thing, and are widely used today (but often not in Lisp). But they were tremendously over-hyped by certain academics and by their industry, resulting in a huge backlash around 1988. “Artificial Intelligence” fell out of favor; the “AI Winter” had arrived.

(Symbolics did launch its own effort to produce a Lisp for the PC, called CLOE, and also partnered with other Lisp companies, particularly Gold Hill, so that customers could develop on a Symbolics and deploy on a conventional machine. We were not totally stupid. The bottom line is that interest in Lisp just declined too much.)

Meanwhile, back at Symbolics, there were huge internal management conflicts, leading to the resignation of much of top management, who were replaced by the board of directors with new CEO’s who did not do a good job, and did not have the vision to see what was happening. Symbolics signed long-term leases on big new offices and a new factory, anticipating growth that did not come, and were unable to sublease the properties due to office-space gluts, which drained a great deal of money. There were rounds of layoffs. More and more of us realized what was going on, and that Symbolics was not reacting. Having created an object-oriented database system for Lisp called Statice, I left in 1988 with several co-workers to form Object Design, Inc., to make an object-oriented database system for the brand-new mainstream object-oriented language, C++. (The company was very successful and currently exists as the ObjectStore division of Progress Software (www.objectstore.com). I’m looking forward to the 20th-year reunion party next summer.)

Symbolics did try to deal with the situation, first by making Lisp machines that were plug-in boards that could be connected to conventional computers. One problem is that they kept betting on the wrong horses. The MacIvory was a Symbolics Ivory chip (yes, we made our own CPU chips) that plugged into the NuBus (oops, long-since gone) on a Macintosh (oops, not the leading platform). Later, they finally gave up on competing with the big chip makers, and made a plug-in board using a fast chip from a major manufacturer: the DEC Alpha architecture (oops, killed by HP/Compaq, should have used the Intel). By this time it was all too little, too late.

The person who commented on the previous blog entry referred by to an MIT Masters thesis by one Eve Philips (see http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/ai-business.pdf) called “If It Works, It’s Not AI: A Commercial Look at Artificial Intelligence Startups”. This is the first I’ve heard of it, but evidently she got help from Tom Knight, who is one of the other Symbolics co-founders and knows as much or more about Symbolics history than I. Let’s see what she says.

Hey, this looks great. Well worth reading! She definitely knows what she’s talking about, and it’s fun to read. It brings back a lot of old memories for me. If you ever want to start a company, you can learn a lot from reading “war stories” like the ones herein.

Here are some comments, as I read along. Much of the paper is about the AI software vendors, but their fate had a strong effect on Symbolics.

Oh, of course, the fact that DARPA cut funding in the late 80’s is very important. Many of the Symbolics primary-market customers had been ultimately funded by DARPA research grants.

Yes, there were some exciting successes with rule-based expert systems. Inference’s “Authorizer’s Assistant” for American Express, to help the people who talk to you on the phone to make sure you’re not using an AmEx card fraudulently, ran on Symbolics machines. I learn here that it was credited with a 45-67% internal rate of return on investment, which is very impressive.

The paper has an anachronism: “Few large software firms providing languages (namely Microsoft) provide any kind of Lisp support.” Microsoft’s dominance was years away when these events happened. For example, remember that the first viable Windows O/S, release 3.1, came out in in 1990. But her overall point is valid.

She says “There was a large amount of hubris, not completely unwarranted, by the AI community that Lisp would change the way computer systems everywhere ran.” That is absolutely true. It’s not as wrong as it sounds: many ideas from Lisp have become mainstream, particularly managed (garbage-collected) storage, and Lisp gets some of the credit for the acceptance of object-oriented programming. I have no question that Lisp was a huge influence on Java, and thence on C#. Note that the Microsoft Common Language Runtime technology is currently under the direction of the awesome Patrick Dussud, who was the major Lisp wizard from the third MIT-Lisp-machine company, Texas Instruments.

But back then we really believed in Lisp. We felt only scorn for anyone trying to write an expert system in C; that was part of our corporate culture. We really did think Lisp would “change the world” analogously to the way “sixties-era” people thought the world could be changed by “peace, love, and joy”. Sorry, it’s not that easy.

Which reminds me, I cannot recommend highly enough the book “Patterns of Software: Tales from the Software Community” by Richard Gabriel (http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf) regarding the process by which technology moves from the lab to the market. Gabriel is one of the five main Common Lisp designers (along with Guy Steele, Scott Fahlman, David Moon, and myself), but the key points here go way beyond Lisp. This is the culmination of the series of papers by Gabriel starting with his original “Worse is Better”. Here the ideas are far more developed. His insights are unique and extremely persuasive.

OK, back to Eve Philips: at chapter 5 she describes “The AI Hardware Industry”, starting with the MIT Lisp machine. Does she get it right? Well, she says “14 AI lab hackers joined them”; see my previous post about this figure, but in context this is a very minor issue. The rest of the story is right on. (She even mentions the real-estate problems I pointed out above!) She amply demonstrates the weaknesses of Symbolics management and marketing, too. This is an excellent piece of work.

Symbolics was tremendously fun. We had a lot of success for a while, and went public. My colleagues were some of the skilled and likable technical people you could ever hope to work with. I learned a lot from them. I wouldn’t have missed it for the world.

After I left, I thought I’d never see Lisp again. But now I find myself at ITA Software, where we’re writing a huge, complex transaction-processing system (a new airline reservation system, initially for Air Canada), whose core is in Common Lisp. We almost certainly have the largest team of Common Lisp programmers in the world. Our development environment is OK, but I really wish I had a Lisp machine again.

More about Why Symbolics Failed

I just came across “Symbolics, Inc: A failure of heterogeneous engineering” by Alvin Graylin, Kari Anne Hoir Kjolaas, Jonathan Loflin, and Jimmie D. Walker III (it doesn’t say with whom they are affiliated, and there is no date), at http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/Symbolics.pdf

This is an excellent paper, and if you are interested in what happened to Symbolics, it’s a must-read.

The paper’s thesis is based on a concept called “heterogeneous engineering”, but it’s hard to see what they mean by that other than “running a company well”. They have fancy ways of saying that you can’t just do technology, you have to do marketing and sales and finance and so on, which is rather obvious. They are quite right about the wide diversity of feelings about the long-term vision of Symbolics, and I should have mentioned that in my essay as being one of the biggest problems with Symbolics. The random directions of R&D, often not co-ordinated with the rest of the company, are well-described here (they had good sources, including lots of characteristically, harshly honest email from Dave Moon). The separation between the software part of the company in Cambridge, MA and the hardware part of the company in Woodland Hills (later Chatsworth) CA was also a real problem. They say “Once funds were available, Symbolics was spending money like a lottery winner with new-found riches” and that’s absolutely correct. Feature creep was indeed extremely rampant. The paper also has financial figures for Symbolics, which are quite interesting and revealing, showing a steady rise through 1986, followed by falling revenues and negative earnings from 1987 to 1989.

Here are some points I dispute. They say “During the years of growth Symbolics had been searching for a CEO”, leading up to the hiring of Brian Sear. I am pretty sure that only happened when the trouble started. I disagree with the statement by Brian Sear that we didn’t take care of our current customers; we really did work hard at that, and I think that’s one of the reasons so many former Symbolics customers are so nostalgic. I don’t think Russell is right that “many of the Symbolics machines were purchased by researchers funded through the Star Wars program”, a point which they repeat many times. However, many were funded through DARPA, and if you just substitute that for all the claims about “Star Wars”, then what they say is right. The claim that “the proliferation of LISP machines may have exceeded the proliferation of LISP programmers” is hyperbole. It’s not true that nobody thought about a broader market than the researchers; rather, we intended to sell to value-added resellers (VAR’s) and original equipment manufacturers (OEM’s). The phrase “VARs and OEMs” was practically a mantra. Unfortunately, we only managed to do it once (ICAD). While they are right that Sun machines “could be used for many other applications”, the interesting point is the reason for that: why did Sun’s have many applications available? The rise of Unix as a portable platform, which was a new concept at the time, had a lot to do with it, as well as Sun’s prices. They don’t consider why Apollo failed.

There’s plenty more. To the authors, wherever you are: thank you very much!

Subspace / Continuum History

2006-02-01 08:00:00

Archived from an unknown source. Possibly Gravitron?

In regards for history:

Chapter #1

(Around) December 1995 is when it all started. Rod Humble wished to create something like Air Warrior but online, he approached Virgin Interactive Entertainment with the idea and they replied with something along the lines of "here's the cash, good luck". Rod called Jeff Petersen and asked if he's interested in helping him creating an online only game, Jeff agreed. Inorder to overcome lag Jeff decided it would be best to test an engine which simulates neutonian physic principles - an object in motion tends to keep the same vector, and on that he also built prediction forumlas. They also enlisted Juan Sanchez, with whom Rod had worked before to design them some graphics for it. Either way, after some initial advancement they've decided to put it on the public gamers block to test it and code named it Sniper. They got a few people who played it around and gave feedback. After a short alpha testing period they've decided that they've learned enough and decided to pull the plug. The shockwave that came back from the community whence the announcement was recieved impressed them enough to have it kept in development. They moved onto beta at early-mid '96 and dubbed it SubSpace. From there it entered a real development cycle and was opened and advertised by word-of-mouth around to many people. Michael Simpson (Blackie) was assigned by VIE as an external producer from their westwood studio division to serve as a promotional agent, community manager and overall public relations person. Jeff and Rod are starting to prepare to leave VIE.

Chapter #2

At late '97 SubSpace officially started entering retail cycle with pre-orders being collected and demo priviledges being revoked (demo clients confined to 15 minutes of play alone and only first four ships accessable). The reason behind this was that VIE was going down hill, loosely losing money (only thing kept them afloat this long was westwood and C&C frenchise) and tried to cash out on any bone they could get. Early 98 a small work, primarily on Juan's part and mainly being blurped about by Michael, begins on SubSpace 2, it soon enough dissolves to dust and never being discussed again. Skipping to 98, VIE classifies SubSpace as a B-rated product, which means it gets no advertisement budget. In addition, they only manufactured a mere 10,000 or so copies of it and tossed it to only a select few retailers for $30 a box. Along those lines, VIE also lost on an opportunity to sell SubSpace to Microsoft, as part of the "The Zone" which would've ensured the game's long-term success and continuance, for a very nice sum. The deal fell through the cracks due to meddlings of Viacom, who owned VIE at the time, untill it was completely screwed up. Rod and Jeff, being enraged on all of this realised it was over and notified VIE several months ahead of their contacts expiring (they were employeed by a 1 year contract with an option to extend it another year at a time) about their intentions to leave and go independant, they tried to negotiate with them to enter developer-publisher relationship, naturally it didn't work and they seperated up. On October '98 the wind broke from an inside source and rumors, which would later be proven truth, begun to fly about VIE bunkrupting and SubSpace being abandoned and left without support nor a development team. Although franatically denied by Michael, the horror was proven true, and not too soon after of, VIE officially announced the shutdown of SubSpace and complete support withdrawl accompanied with that of a filing chapter 11 and sale-off of its remaining assets (Viacom had WestWood already sold to Electronic Arts, along with Michael). The owners of Brilliant Digital Entertainment (Kazaa/Altnet.com) created an asset holding company called Ozaq2, and are now the sole holders of SubSpace copyrights. By then, the original developers are long gone.

Chapter #3

Early '98/Late '97 the ex-SubSpace developers : Rod, Jeff and Juan move to Origin, which contracts them to create Crusader Online. Unfortunately, however, after producing an alpha version, Origin execute a clause which states that they may pull the plug if they do not like their demo and terminates the project. Nick Fisher (AKA trixter, as known in subspace) approaches them, and together they form Harmless Games, their first task, taking what they had done and building it further onto a viable profitable online game, the crusader online demo is redubbed as Infantry (online). On a side note, I have no idea if what they made was what is known as the would-be "Crusader : No Mercy" (the so called online version of Crusader with only 1, possibly fake, screenshot ever released of the project). Nick creates the GameFan Network which will rehost warzone.com and Infantry's gaming servers, among other websites and deals. Jeff have the game quickly plow through pre-alpha and rapidly working it up to be suitable for alpha testing. Larry Cordner is contracted to create content editors for the game, though he won't be staying on pay for long (and disappear/sacked upon the move to SOE). By October '98 the Harmless Games site is being put up, along the "most" official Infantry section, which is the only one which gets any attention at all throughout that site. Juan creates for Rod the insignia of HG - the Tri-Nitro-Teddy. Jeremy Weeks is contracted to create secondary concept art for Infantry. On November '98 HG officially announces Infantry, alpha testing is to commence shortly. Juan, with his artwork part finished, leaves the team. At March '99 HG officially announces BrainScan, a company founded by Nick, to be the game's publisher, after many attempts at signing a publishing deal kept falling through, beta testing is to begin later that year with a full release with pay to play schedueled not far behind. Rod and Jeff clash about Rod's desire to bring Infantry to verant/studio 989 (later renamed SOE) via his connections, Rod eventually leaves Infantry and HG to head up a high window position in Sony Online Entertainment (senior executive of games development I believe it was). At late 2000 due to the dot-com crash, express.com fails to pay GameFan Network its dues (advertisement banner payments, of course) and GFN crash and burns due to the lack of millions of dollars to cover its debts (as well as silently BrainScan), Infantry's servers are slated for a shutdown, the hunt for a new host/publisher begins. And so they contact Rod, and eventually all of the intellectual properties owned by Nick are sold to SOE (the ICQ-esque EGN2 program as well), with Infantry & Jeff among them for an "undisclosed sum" (according to Nick the deal earned him a figure around 6 million USD). SOE's "The Station" announces the aquisition of Infantry. Infantry is still being ran on GFN's last remaining online server, which for some reason someone (whoever the box-hosting was bought from) forgot to take down, well that is, untill late of October at which it is being brought down and the long coma begins.

Chapter #4

Coming November 2000, Infantry is going back online at SOE. Not so long after of, Cosmic Rift begins development, the SubSpace Clone which at at April 2001 is being announced publicaly. Jeff is becoming more and more abscent untill finally disappearing from Infantry and its development altogether (we later learn that he was pulled away by Rod's steering and removed onto EQ projects and SWG). SOE partially "fires" Jeremy only to rehire him later. Then at April 2002 the hellfire spits the brimstone : Infantry is going Pay 2 Play and the chain of broken promises and EQ Customer Support personall being assigned as the games's Exec. Producers begins. Alot of miscontent and grief grises from the player base. Some people begin to run private limited-ability servers from the beta era. Infantry players who had access to beta software, Gravitron among them, being outragous by the injustice done to Jeff, the game, the betrayel of Rod, not being able to stand SOE's continued abuse, mistreatment and lies, make a shoe stomp by gathering all possible available material (namely mainly, beta client, beta server and editing tools) and making a statement&point by releasing it to the public and whomever desires (despite the predictable effect of anger and alienation by Jeff). Rod plummets onto the depths of EQ, Jeff disappears off radar. Everyone else continues with their seperate lives, employements and projects.

Chapter #5

A supplemental as for SubSpace's well being.

About post-VIE SS: A Norwegian named Robert Oslo, his alias Baudchaser, approached a finnish ISP called Inet. Cutting a lot of events (and shit) short, he, along with the one known as Xalimar (an Exodus/C&W employee) whose name eludes me, became the two carriers of the SS torch, as they arranged for the hosting of the game's zones. BaudChaser formed the SubSpace Council and for as long as he stayed around upto his departure, took care to have SS kept going and battled a lot of cheating, staff troubles (abuse) and grief. Eventually Inet stopped hosting SS and now Xalimar alone carry the burden, for the most part, of hosting core SS zones. Priit Kasesalu, who apparently been playing the game, started working for the current chief in power of SSC and ex-Vangel AML league sysop, Alex Zinner (AKA Ghost Ship), hacking the server software and eventually creating his own client by reverse engineering the original SS, possibly having some sort of access to the source.

About "SubSpace 2" rumors mid 2003: The owners of BDE wished to create the perfect Peer2Peering network (Altnet.com), they needed a flagship product to prove investors that their way is just and right. For that, they contacted a company called Horizon which was specializing in P2P technology. Horizon was creating a P2P technology called Horizon's DEx, later on Horizon renamed to SilverPlatter and their technology to Alloy. Somewhere around 2002-2003 they were supposed to use BDE's Altnet in an E3 show to present the manifestation of this technology - Space Commander, presumably, SubSpace remade a new and being used as the first massive MPOG under Peer2Peer. However, silverplatter eventually went bunkrupt, for some reason, and nothing was known since and before about BDE's attempts at using the SubSpace properties which they owned aside this single E3 presentation.

Chapter #6

Additional update:

(wow, I must put this through a grammatical correctional application) Somewhere along 2004-2005 Rod quit SOE as Executive Producer/VP of production (SOE seeming nowdays as a leaking boat about to drown) and joins Maxis to head off sims projects. October 2005, in a series of lay-offs Jeremy Weeks (yankee) is fired from SOE, apparently permanently this time, any shread of hope (not much to begin with) Infantry had is now diminished next to null. Jeff is still assumed to be working at SOE. Juan surfaces at pendamic studios, working for Lucas Arts on BattleFront I & II (and has a website, www.allinjuan.com). Somewhere later that year or in the beginning of 2006, a high ranking moderator-player known as Mar snaps in face of the continued abuse/neglect by the owners and in an anti-SOE move releases the latest editor tools, his efforts are quickly hashed, however, and it is unknown if anyone got their hands on the software, he is of course terminated of status and subsequently banned from the game. February 2006, Rod is tracked down and grants his point of view, being accused of not lending assistance to infantry while being games development exec. at SOE and clearly in a position to help:


You know what? You are probably right. At the time I was focused entirely on the big EQ issues which the entire company's survival hinged on. In retrospect Infantry could have been turned into a bigger product than it was by extra rescources (although I will say it got more than other titles of similar sub bases). Somewhat ironically now I am completely fatigued by graphical MUDS, games like Infantry are interesting to me again. So yeah, I could have done some more at the time. Hopefully a lesson learned. Anyways I hope that serves by way of an honest explanation. I can imagine how frustrating it must have been as a player. All the best,

Rod


Glenn Henry interview

2004-06-09 08:00:00

This is an archive of an interview from the now defunct linuxdevices.com. It originally lived at linuxdevices.com/articles/AT2656883479.html. It's archived here because I refer to it on this site and the original source is gone.

Q1: Can you give us a short history of Centaur?

A1: The idea came to me and my co-founders in 1993. We were working at Dell -- I was Senior Vice President in charge of products. At that time, we were paying Intel I think $160 per processor. That was the lowest Intel price, and that was a special deal. So, it occurred to me that you could make a compatible part and sell it a lot lower. And that part, if not equally fast, would be fast enough for the masses of people.

No one seemed interested in doing that. AMD was just starting in the x86 business at the time, and they were trying to compete head-on with Intel. So, in early 1994, I quit Dell, and three other people came with me. We spent a year working out of our homes trying to get funding to start a company to build low-cost, low-power, x86 chips that were differentiated from Intel but fully compatible with all the x86 software.

Our theory at that time was sort of a "build it, and they will come" theory. We thought that if we could lower the price of the processor, it would stimulate not only low-cost PCs, but new applications we didn't know about in 1994.

We found funding from an American semiconductor company called IDT, and started Centaur. Centaur has never been an independent company in one sense -- we were previously wholly owned by IDT, and now we're wholly owned by VIA. On the other hand, we're an independent company in the way we operate. We have our own culture, our own payroll, etc.

We started officially on Apr. 1, 1995, the day the check came in the mail, an auspicious date. We shipped our first processor two years later, and then another a year and a half after that, in early-1999.

IDT decided to sell us because they had no presence in x86 or the PC world -- there was no synergism there. So they publicly put us on sale, and VIA bought us in September of 1999. The marriage was perfect, because VIA produces all the other silicon that goes into a PC. They design boards, their sister and cousin companies produce boards, their other cousin company makes all the other little low-cost parts for a PC -- all that was missing, from a hardware point of view, was the processor.

In fact, since you're LinuxDevices, I'll make a comment. When I was going around selling this argument, I would point out that the price of everything in a PC but two things was going down drastically, and therefore there's this huge opportunity to move "PC processing" into new dimensions. But the two things that weren't going down were reducing the opportunity. And those two things were the Intel processor, and the Microsoft software.

When we started, we had no answer for what to do about the Microsoft software. We just attacked the Intel processor part of it. But in the meantime, along came Linux. Our percentage of Linux -- I suspect, although I don't have the numbers to give you -- is much higher than other peoples' percentage of Linux, just because of the characteristics of our part.

VIA also had that vision of low-cost, low-power system platforms, so it was a good marriage, because we had the secret ingredient that was missing. As long as you have to buy a processor from Intel, you're obviously restricted in how low a price or small a form factor you can have.

Q2: So, currently the relationship to VIA is "wholly owned subsidiary?"

A2: Yes, in one sense. We're very independent, on a day-to-day basis. They don't send people here. My titular boss is WenChi Chen, the head of VIA, but I talk to him once a month by phone, and it's usually on strategic things. Day-to-day, month-to-month, we operate independently. We have our own payroll, own culture, etc. On the other hand, in terms of product strategy for the future, and practical issues like manufacturing and product support, we work very closely with them. In one sense we're an integrated division, and in one sense we're a contract processor design firm.

Q3: How many employees do you have now?

A3: We have roughly 82. That was one of our selling themes when I started this. But, it was a catch-22. To get people even vaguely interested, we had to have a low-cost theme. To have a low-cost theme, you have to have a very lean design cost, too. But on the other hand, when I told people, "Well, there's four of us in our kitchens, and with another 20 or 30 people we could build an x86 processor," no one would believe that.

Q4: Yeah, how is that possible?

A4: It's made possible by two things. One is sort of a product focus. The other is the culture, how we operate. Let me talk about the product focus first.

Intel designs processors -- and so does AMD -- number one, to be the world's fastest, and number two, to cover the whole world. The same basic processor is packaged in the morning as a server processor, and in the afternoon as a mobile processor, etc., etc. Not quite true, but... They sort of cover the world of workstations, servers, high-cost desktops, and mobile units. And AMD tried to do that. But they're also trying to be the world's fastest.

The idea I had, which actually was hard for people to accept could be successful, was, "Let's not try to make the world's fastest. Let's look at all the characteristics. Speed is one, cost is another, power consumption is another, footprint is another. And let's make something that wins on footprint size, cost, and power, and is fast enough to get the job done for 90 percent of the people but is not the fastest thing in the world."

The last 10 percent of performance is a huge cost. And not just the hardware side, but also in design complexity. So, in fairness, our parts are slower than Intel's. On the other hand, our parts are fast enough to get the job done for most people. Other than the marketing disadvantage of having slower parts, our parts perform quite well. And they have much lower power, and a much smaller footprint, and they cost much less. Those characteristics appeal to a number of applications.

So, that's our theme. The marketing guys don't like me to say "fast enough" performance, but you know, we're not as fast, head-to-head, as Intel is. But, we are fast enough to do 90 percent of the applications that are done, using a processor. Maybe even more than that. And we have very low power -- much lower than Intel or AMD -- and a really small footprint size -- much smaller than Intel or AMD.

The small footprint only appeals to some people, but for that class of people, i.e., in the embedded space, that characteristic is important. And, of course, the cost is very, very good. I can't give you actual numbers, but our parts sell in the neighborhood of $30 to $40 for a 1GHz part.

There's a second secret ingredient. I had 28 years of management before I started this. And, I had the luxury of starting a company with a clean sheet of paper, with three other guys who had worked for me for a long time. Our original owner, IDT, sent money and left us alone. So, we have created a culture that I think is the best in the world for doing this.

Our engineers are extremely experienced, and very, very productive. One engineer here does the work of many, many others in a big company. I won't go through all the details, because they're not technical, but basically, we started with the theory, "We're going to do this with 20 or 30 designers." Remember the 82 people I mentioned? Of those, only 35 are actual designers. The rest are testers, and things like that.

So, we said, we were going to do it, with that many people. As I said before, we had this idea to constrain the complexity of the hardware. We hired just the right people, and gave them just the right environment to work in. We bought the best tools, and developed our own tools when the best tools weren't available, etc., etc.

So I have a long story there, but the punchline is that we were able to hire extremely experienced and good people, and keep those people. We just passed our nine year anniversary, and the key people who started the company in the first year are almost all still here.

So, this is the secret, actually: all the things I do are underlying things to allow us to hire the right people and keep them motivated and happy, and not leaving.

Our general claim is, "This is a company designed by me to be the kind of place I wanted to work in as an engineer." It doesn't fit everyone, but it fits certain people, and for those people, it's probably the best environment in the world.

Q5: Bravo! That sounds like a great way to create a company.

A5: We were very lucky. I found two people who personally believed in that. This is a very hard story to sell. You got to go back to 1994, when I was travelling from company to company, and my story was, "Lookit, there's four of us in our kitchens in Austin, I want you to give us $15 million dollars, I want you to leave me alone for two years, and then I'll deliver an x86 processor." Right?

That's a very hard story to sell. We were lucky to find at the time, the CEO of IDT, a person by the name of Len Perham, who, personally believed the story. Since he was the CEO, Len was able to get others to believe in it. With VIA, the person we found that believed that was WenChi Chen, who's the CEO of VIA. Those people were both visionaries in their own right, and understood the importance of having an x86 processor.

My basic argument for why a person would want to do this was simple. It's clear that the processor is the black hole, and that all silicon is going to fall into it at some point. Integration is inevitable, in our business. Those who can do a processor can control their system destiny, and those who don't will end up totally at the mercy of other people, who can shut them out of business right away.

And as an example, when IDT bought us, they were making a lot of money on SRAMs that went into caches on PC motherboards. Two years later, there were no SRAMs on PC motherboards, because Intel put them on the die. That's going to happen to some of the chips today. All the other chips are gonna disappear at some point, and all that's left is the big chip, with the processor in the middle. You have to own that processor technology, or it won't be your chip.

So that's my basic sell job. It reached two people, but they were the right people.

Q6: Can you give a real quick summary of the history of the processors?

A6: Our first processor we shipped was called the WinChip C6. "WinChip" was a branding name that IDT had. It was 200MHz, but it had MMX. It was a Pentium with MMX. We shipped two or three more straightforward derivatives of that, added 3DNow, put an on-chip cache on it, then further enhanced the speed, and that's where we were when VIA bought us.

With VIA, we've shipped several major designs.

The first one we call internally C5A. There are three different names... it's very confusing. When I talk to people, I usually end up using our internal names. I used those in my talk at the Microprocessor Forum. VIA also uses another codename for a class of products that covers several of our products, names like Samuel, Nehemiah -- they're Bible names. And then there's the way the product is sold.

The first part we sold for VIA had a Pentium III bus on it, and was around 600MHz. Since VIA bought us, four and a half years ago, we have shipped four different variations. Each one is faster; each one has more functions, is more capable; each one is relatively lower power. The top Megahertz goes up, but the watts per Megahertz always goes down. They're all the same cost.

The product we're shipping now, the C5P has a top speed of 1.4 to 1.5GHz, today, but the sweet spot is 1GHz. We have a fanless version at 1GHz. We also sell all the way down to 533 or even 400MHz, for low-power applications.

To give you an idea about the 1GHz version we're selling today, the worst case power -- not "typical" or "average" power, which other people talk about -- our worst case power is 7 watts, which is low enough to do fanless at 1GHz [story], and no one else can do that.

Second, we also sell that same part at 533. It's worst case power is 2.5 watts. So, remember, I'm talking worst-case power. Typical power, which a lot of people quote, is about half of the worst-case power. So, if we want to play games, we could say it's a 1 to 1.5 watt part at 533MHz, and it's a 3-watt part at 1GHz.

Along the the way, we've used four different technologies. All our technologies since we were bought by VIA have been with TSMC [Taiwan Semiconductor Manufacturing Company]. We've used four major technologies, with, obviously, sub-technologies. So we've shipped four major designs, with two or three minor variations, but they weren't radically different. That's in four years.

We design products very quickly. That was also part of my theme: be lower cost than everybody else, and be able to move faster than everybody else. The things that make you able to do things with a small group also allow you to do things quickly. Actually, the more people you have, the slower things go: more communication, more decision-making, etc.

By the way, I stole this idea from Michael Dell. #### Quick anecdote: I was an IBM Fellow, and I managed very large groups -- hundreds of people -- at IBM. And I went to Dell originally to be their first Vice President of R&D. This was in 1988. So, I get to Dell, and find that the R&D department is six or seven guys that work directly for Michael! And Michael says, "Your job is to compete with Compaq." And I say, "Well, how can you do that with six or seven guys?" And he says, "That's the secret. We'll always be lower-cost, and we'll move quicker than they are." And of course, that's worked out very well at Dell.

We put out a lot of products, in a short period of time, which is actually a major competitive advantage. To give you one minor example, early this year, Intel started shipping a new processor called the Prescott. It's a variation of the Pentium 4. And it had new instructions in it, basically, that are called SSE3. We got our first part in late January. Those instructions are already in the next processor, that we've taped out to IBM.

Q7: That's the C5J?

A7: Yes. That's what I talked about [at the recent Embedded Processor Forum] out in California. I said there's four major processor types that we've shipped in four and a half years, but I take it back. There's five. The C5A, C5B, C5C, C5XL, and C5P. Five major designs that we've shipped in four and a half years. And the sixth is the C5J. It's headed for IBM.

Q8: Is TSMC still your normal fab?

A8:The old designs will still sell for quite a while. The new design is going to IBM. So we'll have both. We'll be shipping partial IBM and partial TSMC for quite a while. And we may go back again to TSMC in the future. This is normal business. We haven't burned any bridges. At least, we don't think we have. VIA does a tremendous amount of business with TSMC, and has a very close relation.

As technology advances, no one remains the best in the world across all technology versions. Typically, in 0.13 [micron process], this person's better than this person. Then you go to 0.09, and it may be different.

Q9: In terms of competition between embedded x86 processors, with AMD and Intel, where does VIA stand? For example, in comparison with the AMD NX line?

A9: All I know is what's in the specs. [AMD's NX] looks suspiciously like a 32-bit Athlon...

Q10: They did tell us it was a tweaked Athlon. [story]

A10: Right. And their power is reduced over their normal Athlon numbers, but it is still higher than ours. Let's be fair about it. If you wanted to build the fastest thing in the world, you'd choose the AMD part, or the Intel Pentium 4. However, we beat them on power -- both of them -- we beat them on cost, and we beat them on footprint [comparison chart at right, click to enlarge]. So, if what you want to build is a set-top box, or what you want to build is sort of a classic PC, or what you want to build is a thin client terminal, or what you want to build is a Webpad, our processor's performance is adequate and we win easily on cost, power, and footprint.

Q11: How do you position yourself relative to the GX, the old Geode stuff?

A11: Well, we blow it away in performance. I mean, it's a 400MHz part. It is a two-chip solution, and we are a three-chip solution today. I don't really know its specs, but power is probably close when you get down to the same Megahertz. It stops at 400MHz, while we start at 400MHz and go up to 1.5GHz. From 400 on up, we're faster, and our power is good. If 100MHz would do you, or 200MHz -- whatever their low point is, which I don't know exactly -- if that's good enough for you, then they will have lower power, because we don't go below 400MHz.

We think they're squeezed into a really narrow niche in the world, because their performance is so low. Anything from 400MHz on up, in the power that goes with that, we win.

So here's how we look at the world:

From 400MHz to 1GHz, there's nothing but us, that's competitive.

From 1GHz to 1.5GHz, we'll compete with the low-end of Intel and AMD.

From 1.5GHz on up today, if that's the speed you want, then you choose AMD or Intel.

My opinion is down at 400MHz and below, the GX has a very narrow slice. They're competing with things that have even better power than they do and good prices, the classical non-x86 embedded processors based on ARM and MIPS. And right on top of them, there's us. We have better performance and equal power at our low end, which is their high end. And, we stretch on up until we run into the bottom end of AMD and Intel.

We think that our area, the 500MHz to 1.5GHz range, represents a potentially massive opportunity. What we're seeing is interesting. About 50 percent of our sales are going into non-PCs. We're not taking sales away from other people, as much as we are enabling new applications -- things that have historically not been done at all, because it either was too expensive, or the power was too high, or the software cost was too great.

You know the VIA mini-ITX boards? [story] That is one of the smartest moves.

All of our engineers like to play. You know, there are robots roaming around here; people have built things like that? One guy built a rocket controller last week -- you know, normal engineering work. What our engineers do is what I tell people to do: buy a mini-ITX board, and add your value-add with frames, cover, and software.

It really is a major improvement over the classic thing where you had to do your own hardware. That board is so cheap: $70 to $150 down at Fry's, depending upon the variation. And it has all the I/O in the world. You can get 18 different versions of it, etc., you know the story. You can customize it using software, and you have a wide variety of operating systems to choose from, ranging from the world's most powerful, most expensive, to things that are free and very good.

Q12: What about nano-ITX [story] and even smaller opportunities, like PC/104 and System-on-Modules. There are standards from SBC companies like Kontron and 20 other companies. You ought to have a DevCon, like ARM and Intel do, to enable all these guys. Do you do anything special to enable them?

A12: I'm not sure of all that is done. VIA does make sample designs, board schematics, etc., available, and works with many customers on their unique solutions. And, of course, the mini-ITX does stimulate lots of designs.

We have other customers, that are doing their own unique designs, and some of them we work with to make things small. Right now, we are a three-chip solution. We're working on reducing that to a two-chip solution. By three-chip solution, I mean processor, northbridge, and southbridge to make a complete system. That'll be reduced to a two-chip solution, at some point, which reduces the footprint even more.

We've made a major improvement with the new package. You've seen that little teeny package we're shipping today, that we call the nano-BGA? [story] It's the size of a penny-type-thing. That's a real breakthrough on processor size. We need to get the other two chips boiled down to go much further.

People are doing other designs. For example, one of the things VIA's touting -- I don't know much about the details of it -- there's a group that's doing this handheld x86 gaming platform? [story] It sits in your hand like a Gameboy, but it's got a full x86 platform under it. And the theory is, you can run PC games on it. It has a custom motherboard in it, using our processor.

Q13: When you mentioned reducing from three chips to two chips, that reminded me of Transmeta. What would you say about Transmeta as an x86 competitor?

A13: Well, Transmeta has good power. They're really not any better than us, but we're better than the other guys. So I'd say, yeah, equal in power. They have okay footprint. They have two chips, but their two chips are big. I had a chart in the fall Microprocessor Forum showing that our three chips were the same size as their two chips. So there's not a big difference in footprint size. Biggest difference is threefold.

One, is that they're very costly. In their last quarterly earnings conference call, they quoted orally an ASP [average selling price] of I think it was $70, which is ridiculously high. You notice they're losing mass amounts of money -- $20 million a quarter -- so they need to try and keep prices high.

Two, is the fact that our performance on real-world applications is much better than Transmeta's. They do well at simple repetitive loops, because their architecture is designed to do well there. But, for real applications with big working sets, rambling threads of control, etc., we beat them badly. For example, in office applications such as Word, Excel, etc. -- the bread and butter of what the real world does."

But the other argument is sort of subtle, and people miss it. Their platform is very restricted. Their two-chip solution only talks to two graphics chips that exist, in the world, right? And you have to choose. And if you don't like the memory controller they have, let's say you want X, etc. -- they only support certain memories. etc., etc.

We have a processor that talks to 15 different northbridges, each of which talk to 17 different southbridges. You want four network things? Fine, you got four network things. You want an integrated graphics panel, AGP... fine. You can configure all of those things using the normal parts. And, what we've found in the embedded world is that one size definitely does not fit all. That's why VIA itself has done eight variations of the mini-ITX board, and keeps doing more, and other vendors have made lots of other variations. Some have four serial ports, some have one; some have four networks, some have none, and so forth and so on. The flexibility offered by these standard parts in the PC world is, I think, a significant advantage.

Q14: You are chipset compatible with Intel parts?

A14: Oh, yes. We do most of our testing with Intel parts, just to make sure. We can drop into Pentium III motherboards, and do it all the time. In the embedded world, we sell primarily with VIA parts. They usually sell our processor either with a bag of parts, or built into the VIA mini-ITX motherboards, and now the nano-ITX [pictured at right, click to enlarge]. The other half of our business is in the world of low-cost PCs. There, it's whatever the board designer or OEM chose to use.

That's always been part of our secret. Basically, my selling strategy for that part of the world [using low-end PCs] is, hey, "Take your design, take your screwdriver, pry up that Pentium III, plug us in, and you'll save XX dollars." That isn't a a very appealing thing in the United States. But if you go to the rest of the world, where people don't have as many dollars as we do, and where there isn't as much PC penetration, that's a reasonably appealing story. We've actually sold millions of parts, with that strategy.

Q15: Can you explain your packaging variations?

A15: The same die goes into multiple packages, and goes into different versions of the product. The C3, that's a desktop part that has the highest power of our parts. Antaur, that's the mobile version. It has fancy, very sophisticated dynamic power management enabled. Eden is the embedded part. All Edens are fanless. They run low enough power to be fanless, at whatever speed they are. So, one of our dies ends up in three branding bins: the C3, the Antaur, or the Eden. When you buy an Eden, 1GHz, there are sub-variations that may not be obvious that actually distinguish whether it was a C5P or a C5XL. But both are branded as Eden 1GHz. So that's the branding strategy.

Cutting across that, there are three packages. One, is the ceramic pin grid array, which is compatible with Pentium III. The PC market usually has a socketed processor, so we have a package that's compatible with that. We also have a cheaper ball grid array version of that, which is a little smaller, and it's cheaper because it's not socketed. And then we have the nanoBGA, which is that little teeny 15 by 15 mm thing; and that's an expensive package, so we charge a little bit more for that. But for people where space is a concern, that's appropriate. That small size is only offered in Eden, because that small size is only appealing to embedded guys. The standard PC market in general doesn't care about small size. When they design a motherboard, they generally design it big enough to handle anybody's processors -- AMD's, Intel's, or ours. We fit easily into that world.

Q16. Our reader surveys [story] show ARM overtaking x86 in new embedded project starts -- though we can't identify the MHz range of those projects. Can you comment on the competition from the higher ends of XScale and ARM and MIPS?

A16: I think there's a class of applications where an x86 is clearly going to win, a class where it's clearly going to lose, and that always leaves a middle.

One where it's going to lose is where your power budget, or your size budget, are really, really small. For example, the PDA, which has got an ARM in it. An x86 processor, even ours or Transmeta's, just consumes too much power. There isn't a size problem, probably, with our design -- there would be with Intel or AMD, where the chips just won't fit in that package [see size comparison photo at right; click to enlarge]. That type of application, you're going to choose an ARM, or a MIPS.

Where I think x86 wins easily, in my experience, is where, number one, the battery size versus runtime expectation is hours -- right? -- not days, not weeks. Or, there's a power supply. Set-top boxes, for example. They all want fanless parts, so they want really low power, but they have a power supply. They're not running on batteries, obviously.

The other characteristic that I think really shows the advantage of x86, is where an application has a user interface of any sophistication, or where the application is doing something particularly sophisticated. Obviously, you can run Windows CE, and even Linux, on something the size of a PDA. But those things are very limited. I have a 400MHz Toshiba PDA, the 800, the latest one out. The 400MHz ARM is the fastest you can buy. I installed an aviation application on it last week, and it runs really slow. I wish I had 800MHz, or 1GHz.

If you want very low power, there's really no choice. If you're not worried too much about the power, then there's two factors. One is how fast you need to run, and the other is the complexity of the software you need to run. Speed and software complexity favor x86.

We're not going to take over the printer world, or the carburetor control world. What we are doing is pushing the size and power envelope down every year, so x86 is able to reach smaller sizes and lower power every year but still maintain the high ground.

I made a joke when I attended the Embedded Processor Forum... I always talk at the fall Forum, you know, the Microprocessor Forum? And there, I'm always the slowest thing. I'm put in the group with 3GHz this, etc. But at this Embedded Forum, we were far and away at 1GHz the fastest thing there, bar none.

There is definitely a major performance gap. We may be slower than a Pentium 4, for example, but we're still a lot faster than the MIPS and ARM chips.

Q17. Except a Geode might have the distinction of being speed-competitive with the ARM chips?

A17: I'll bet you a 400MHz ARM is a lot faster than a 400MHz Geode. There's an age-old penalty of x86. Plus, Geode's a three-year old design. AMD hasn't changed it a bit. It doesn't have SSE, it's still using 3DNow, etc., etc. It doesn't have a number of other modern instructions that you'd like it to have. The ARM 400s are going to have lower power and faster performance. I don't actually know the price in those markets, so I can't comment on that.

Q18: We'd like to get you to talk a little bit more about Linux and the importance of Linux in terms of Centaur's success, and how the fit is.

A18: I think there are three things. First of all, our theme all along is to be able to produce the lowest-cost PC platform that there is. When I was going around selling this, I was talking about the sub-$1,000 PC, which is now a joke. There are PCs being sold with us in it that are sub-$200. So, let's look at a $200 PC, and right next to it is a $100 off-the-shelf version of Windows XP. The presence of a low-cost operating system is substantially more important at the low end of the low-price hardware market, which is our focus. I think it's a very big deal that we have a low-cost operating system to go with a sub-$200 PC.

The second thing is the embedded space. In embedded -- I'm using that as a very broad term -- one of the characteristics is customization. People are building applications that do a particular thing well, be it a set-top box, or a rocket controller, or whatever it may be. Customization for hardware is probably easier to do on Linux than it is on Windows. If it's an application that needs to have close control of hardware, needs to run very fast, needs to be lean, etc., than a lot of people are going to want to do it on Linux.

The third one, is one you haven't asked me about, this is actually my pet hobby, here -- we've added these fully sophisticated and very powerful security instructions into the...

Q19: That was my last question!

A19: So the classic question is, hey, you built some hardware, who's going to use it? Well, the answer is, six months after we first started shipping our product with encryption in it [story], we have three or four operating systems, including Linux, OpenBSD, and FreeBSD, directly supporting our security features in the kernel.

Getting support that quickly can't happen in the Microsoft world. Maybe they'll support it someday, maybe they won't. Quite honestly, if you want to build it, and hope that someone will come, you've got to count on something like the free software world. Free software makes it very easy for people to add functionality. You've got extremely talented, motivated people in the free software world who, if they think it's right to do it, will do it. That was my strategy with security.

We didn't have to justify it, because it's my hobby, so we did it. But, it would have been hard to justify these new hardware things without a software plan. My theory was simple: if we do it, and we do it right, it will appeal to the really knowledgeable security guys, most of whom live in the free software world. And those guys, if they like it, and see it's right, then they will support it. And they have the wherewithal to support it, because of the way open software works.

So those are my three themes, ignoring the fourth one, that's obvious: that without competition, Windows would cost even more. To summarize, for our business, [Linux is] important because it allows us to build lower-cost PC platforms, it allows people to build new, more sophisticated embedded applications easier, and it allows us, without any software costs, to add new features that we think are important to the world.

Our next processor -- I haven't ever told anyone, so I won't say what it is -- but our next processor has even more things in it that I think will be just as quickly adopted by the open source software world, and provide even more value.

It's always bothered me that hardware can do so many things relatively easily and fast that aren't done today because there's no software to support it. We just decided to try to break the mold. We were going to do hardware that, literally, had no software support at the start. And now the software is there, in several variations, and people are starting to use it. I actually think that's only going to happen in the open source world.

Q20: We'd like a few words from you about your security strategy, how you've been putting security in the chips, and so on.

A20: Securing one's information and data is sort of fundamental to the human need -- it's certainly fundamental to business needs. With the current world, in which everyone's attached to the Internet -- with most peoples' machines having back-door holes in them, whether they know it or not -- and with all the wireless stuff going on, people's data, whether they know it or not, is relatively insecure.

The people who know that are using secure operating systems, and they're encrypting their data. Encrypting of data's been around for a long time. We believe, though, that this should be a pervasive thing that should appear on all platforms, and should be built into all things.

It turns out, though, that security features are all computationally intensive. That's what they do. They take the bits and grind them up using computations, in a way that makes it hard to un-grind them.

So, we said, they're a perfect candidate for hardware. They're well-defined, they're not very big, they run much faster in hardware than in software -- 10 to 30 times, in the examples we use. And, they are so fundamental, that we should add the basic primitives to our processor.

How did we know what to add? We added government standards. The U.S. government has done extensive work on standardizing the encryption protocols, secure digital signature protocols, secure hash protocols. We used the most modern of government standards, built the basic functions into our chip, and did it in such a way that made it very easy for software to use.

Every time you send an email, every time you send a file to someone, that data should be encrypted. It's going out on the Internet, where anyone with half a brain can steal it.

Second, if you really care about not letting people have access to certain data that's on your hard drive, it ought to be encrypted, because half the PCs these days have some, I don't know what the right word is, some "spy" built into it, through a virus or worm, that can steal data and pass it back. You'll never get that prevented through operating system upgrades.

I do have some background, sort of, in security: it's always been my hobby. The fundamental assumption you should make is, assume that someone else can look at what you're looking at. In other words, don't try to protect your data by assuming that no one's going to come steal your hard drive, or no one can snoop through a backdoor in Windows. You protect your data by saying, "Even if they can see the data, what good is it going to do them?"

We think this is going to be a pervasive need. The common if-you-will person's awareness of worms and viruses has gone up a million percent in the last few years, based on all the problems. The awareness of the need to protect data is going to go up substantially, too.

We're doing more than encryption, though. There's another need, which is coming, related to message authentication and digital signatures.

We're encrypting all the time. Every time you buy something over the Web, your order is encrypted. So there is encryption going on already. But the next major thing -- and this is already done in the high-security circles of banks -- is message authentication through digital signatures. How do you know someone didn't intercept that order, and they're sending in their own orders using your credit card number? How do you know, when you get a message from somebody, that they didn't substitute the word "yes" for "no," things like that? These are very important in the world of security. They're well understood in the government world, or the high-security world, and there are government standards on how you do these things. They are called secure hashes, and things like that. So we've added features for those.

To summarize, the things we've added fall into three categories. One is a good hardware random number generator. That was actually the first thing, and that's actually one of the hardest things to do. It sounds trivial, but it's actually very hard to generate randomness, with any kind of process. It needs to be done in hardware. Software cannot generate random numbers that pass the tests that the government and others define.

The second thing we did is a significant speedup in the two basic forms of encryption. One's called symmetric key encryption, and the government standard is AES, which is a follow-on to a thing called DES. So we do AES encryption very fast. The other form of encryption that's widely used is public key encryption, and the most common form there is a thing called RSA. That's what's being used, you know, for secure Web transactions. We think we're the only people who've done this: we added instructions in our new processor that's coming to speed up RSA.

The third thing we've done is added what's called a secure hash algorithm. Again, it's a government standard. Its used for message authentication and digital signatures. It deals with the issue, if you send me an email, how do I know that the email I got was the one you sent? That it wasn't intercepted and changed? And more fundamentally, how do I know that it actually came from you? Anyone can put their name, in our world, on that email. Things like that. So there's got to be some code in that email that I can look at, and know that only you could have sent it. I can explain this more if you want to know.

Q21: That's probably sufficient. We're looking more for the strategy.

A21: Okay, let me back up. Our strategy was, assuming that we believe that security is fundamental and ought to be there, to define the primitive operations that need to be done as the building blocks of security. Those we put into hardware. We're not trying to impose a particular, I don't know, protocol or use. We're just making available the tools. We're doing it for free. The tools are in the processors, at no extra price. They don't require any OS support, no kernel support, no device drivers. It's getting into the kernels of BSD and Linux, but applications can directly use the features [even without kernel support], and the hardware takes care of the multitasking aspects.

The two guys who worked on it with me are both heavy Linux users. They wrote to friends in the security and Linux communities. Very little marketing money was spent.

When the security press release went out, at the Embedded Processor Forum, it had three key quotes, real quotes. Not quotes written by PR managers. My quote was written by a PR manager, but the others weren't. All three were big names in the security world, and all were saying good stuff.

Q22: Beyond security, are other cool features planned?

A22: The next chip has some tools to do computationally intensive things where hardware provides a big advantage. But I don't want to say yet what they are.

Q23: Would they be useful for multimedia?

Yes, for multimedia, and for other things.

Q24: Like a DSP?

A24: Kind of like that.

Q25: Okay, we won't push. We appreciate you taking the time to speak with us. We can't imagine getting the president of AMD or Intel to do this.

A25: Our whole strategy is so close to the, if you will, the fate of Linux. We identify so much with it. We're low-cost, aimed at the common person, we're aimed at new applications, and we don't have any massive PR or marketing or sales budget, so. Actually, I have a special softness in my heart for Linux. I think without Linux our business would be much less than what it is today. It's just very important to us, so, I wanted to give you guys the time.

Other Interviews

If you liked this interview, you might also like this interview with Glenn on IBM history

comp.programming.threads FAQ

2001-08-01 08:00:00

This is an archive of the comp.programming.threads FAQ, which used to be hosted by Bill Lewis at the now defunct lambdacs.com. I believe this is up-to-date as of approximately 2001.

FAQ


  This is a list of the questions which have come up on the newsgroup with
  any answers that were given. (Somewhat edited by yours truly.)  In a few
  cases I have left in the names of the participants.  If you'd like me to
  remove your name, let me know.  If you have other comments/corrects, just
  drop me a line (Bil LambdaCS.com).  (Of course I'll expect *you* to supply 
  any corrections! :-)

  This list is a bit of a hodge-podge, containing everything that I thought
  *might* be useful. Hence it is HUGE and not very well edited. It even has
  duplicates (or worse, near-duplicates). The MFAQ is much smaller and better
  maintained. You may wish to check there first.


-Bil


     ==================================================================
          F R E Q U E N T L Y    A S K E D    Q U E S T I O N S 
     ==================================================================
                Also see:

     Brian's FAQ: http://www.serpentine.com/~bos/threads-faq
     (Sun's Threads page and FAQ is no more.)

  Many of the most general questions can be answered by reading (a) the
  welcome message, (b) the general information on the other threads pages,
  and (c) any of the books on threads.  References to all of these can be
  found in the welcome message.

Q1:   How fast can context switching be?
Q2:   What about special purpose processors?
Q3:   What kinds of issues am I faced with in async cancellation?
Q4:   When should I use these new thread-safe "_r" functions?
Q5:   What benchmarks are there on POSIX threads?
Q6:   Has anyone used the Sparc atomic swap instruction?
Q7:   Are there MT-safe interfaces to DBMS libraries?
Q8:   Why do we need re-entrant system calls?
Q9:   Any "code-coverage" tools for MT applications?
Q10:  How can POSIX join on any thread?
Q11:  What is the UI equivalent for PTHREAD_MUTEX_INITALIZER?
Q12:  How many threads are too many in one heavyweight process? 
Q13:  Is there an atomic mutex_unlock_and_wait_for_event()?
Q14:  Is there an archive of this newsgroup somewhere?
Q15:  Can I copy pthread_mutex_t structures, etc.?
Q16:  After 1800 calls to thr_create() the system freezes. ??
Q17:  Compiling libraries which might be used in threaded or unthreaded apps?
Q18:  What's the difference of signal handling for process and thread? 
Q19:  What about creating large numbers of threads?
Q20:  What about using sigwaitinfo()?
Q21:  How can I have an MT process communicate with many UP processes?
Q22:  Writing Multithreaded code with Sybase CTlib ver 10.x?
Q23:  Can we avoid preemption during spin locks?
Q24:  What about using spin locks instead of adaptive spin locks?
Q25:  Will thr_create(...,THR_NEW_LWP) fail if the new LWP cannot be added?
Q26:  Is the LWP released upon bound thread termination?
Q27:  What's the difference between pthread FIFO the solaris threads scheduling?
Q28:  I really think I need time-sliced RR.
Q29:  How important is it to call mutex_destroy() and cond_destroy()?
Q30:  EAGAIN/ENOMEM etc. apparently aren't in ?!
Q31:  What can I do about TSD being so slow?
Q32:  What happened to the pragma 'unshared' in Sun C?
Q33:  Can I profile an MT-program with the debugger?
Q34:  Sometimes the specified sleep time is SMALLER than what I want.
Q35:  Any debugger that single step a thread while the others are running?
Q36:  Any DOS threads libraries?
Q37:  Any Pthreads for Linux?
Q38:  Any really basic C code example(s) and get us newbies started?
Q39:  Please put some Ada references in the FAQ.
Q40:  Which signals are synchronous, and whicn are are asynchronous?
Q41:  If we compile -D_REENTRANT, but without -lthread, will we have problems?
Q42:  Can Borland C++ for OS/2 give up a TimeSlice?
Q43:  Are there any VALID uses of suspension?
Q44:  What's the status of pthreads on SGI machines?
Q45:  Does the Gnu debugger support threads?
Q46:  What is gang scheduling?
Q47:  LinuxThreads linked with X11, calls to X11 seg fault.
Q48:  Are there Pthreads on NT?
Q49:  What about garbage collection?
Q50:  Does anyone have any information on thread programming for VMS?
Q51:  Any information on the DCE threads library?
Q52:  Can I implement pthread_cleanup_push without a macro?
Q53:  What switches should be passed to particular compilers?
Q54:  How do I find Sun's bug database?
Q55:  How do the various vendors' threads libraries compare?
Q56:  Why don't I need to declare shared variables VOLATILE?
Q57:  Do pthread_cleanup_push/pop HAVE to be macros (thus lexically scoped)?
Q58:  Analyzer Fatal Error[0]:  Slave communication failure ??
Q59:  What is the status of Linux threads?
Q60:  The Sunsoft debugger won't recognize my PThreads program!
Q61:  How are blocking syscall handled in a two-level system?
Q62:  Can one thread read from a socket while another thread writes to it?
Q63:  What's a good way of writing threaded C++ classes?
Q64:  Can thread stacks be built in privately mapped memory?
Q66:  I think I need a FIFO mutex for my program...
Q67:  Why my multi-threaded X11 app with LinuxThreads crashes?
Q68:  How would we put a C++ object into a thread?
Q69:  How different are DEC threads and Pthreads?
Q70:  How can I manipulate POSIX thread IDs?
Q71:  I'd like a "write" that allowed a timeout value...
Q72:  I couldn't get threads to work with glibc-2.0.
Q73:  Can I do dead-owner-process recovery with POSIX mutexes?
Q74:  Will IRIX distribute threads immediately to CPUs?
Q75:  IRIX pthreads won't use both CPUs?
Q76:  Are there thread mutexes, LWP mutexes *and* kernel mutexes?
Q77:  Does anyone know of a MT-safe alternative to setjmp and longjmp?
Q78:  How do I get more information inside a signal handler?
Q79:  Is there a test suite for Pthreads? 
Q80:  Flushing the Store Buffer vs. Compare and Swap
Q81:  How many threads CAN a POSIX process have? 
Q82:  Can Pthreads wait for combinations of conditions?
Q83:  Shouldn't pthread_mutex_trylock() work even if it's NOT PTHREAD_PROCESS_SHARED?
Q84:  What about having a NULL thread ID?
Q85:  Explain Traps under Solaris
Q86:  Is there anything similar to posix conditions variables in Win32 API ?
Q87:  What if a cond_timedwait() times out AND the condition is TRUE?
Q88:  How can I recover from a dying thread?
Q89:  How to implement POSIX Condition variables in Win32?
Q90:  Linux pthreads and X11
Q91:  One thread runs too much, then the next thread runs too much!
Q92:  How do priority levels work?
Q93:  C++ member function as the startup routine for pthread_create(). 
Q94:  Spurious wakeups, absolute time, and pthread_cond_timedwait()
Q95:  Conformance with POSIX 1003.1c vs. POSIX 1003.4a?
Q96:  Cleaning up when kill signal is sent to the thread.?
Q97:  C++ new/delete replacement that is thread safe and fast?
Q98:  beginthread() vs. endthread() vs. CreateThread? (Win32)
Q99:  Using pthread_yield()?
Q100: Why does pthread_cond_wait() reacquire the mutex prior to being cancelled?
Q101: HP-UX 10.30 and threads?
Q102: Signals and threads are not suited to work together?
Q102: Patches in IRIX 6.2 for pthreads support?
Q104: Windows NT Fibers?
Q105: LWP migrating from one CPU to another in Solaris 2.5.1?
Q106: What conditions would cause that thread to disappear?
Q107: What parts, if any, of the STL are thread-safe?
Q108: Do pthreads libraries support cooperative threads?
Q109: Can I avoid mutexes by using globals?
Q110: Aborting an MT Sybase SQL?
Q111: Other MT tools?
Q112: That's not a book. That's a pamphlet!
Q114: How to cleanup TSD in Win32?
Q115: Onyx1 architecture has one problem
Q116: LinuxThreads linked with X11 seg faults.
Q117: Comments about Linux and Threads and X11
Q118: Memory barriers for synchonization
Q119: Recursive mutex debate
Q120: Calling fork() from a thread
Q121: Behavior of [pthread_yield()] sched_yield()
Q122: Behavior of pthread_setspecific()
Q123: Linking under OSF1 3.2: flags and library order
Q124: What is the TID during initialization? 
Q125: TSD destructors run at exit time... and if it crashes?
Q126: Cancellation and condition variables
Q127: RedHat 4.2 and LinuxThreads?
Q128: How do I measure thread timings? 
Q129: Contrasting Win32 and POSIX thread designs
Q130: What does POSIX say about putting stubs in libc?
Q131: MT GC Issues
Q132: Some details on using CMA threads on Digital UNIX 
Q133: When do you need to know which CPU a thread is on?
Q134: Is any difference between default and static mutex initialization? 
Q135: Is there a timer for Multithreaded Programs? 
Q136: Roll-your-own Semaphores 
Q137: Solaris sockets don't like POSIX_C_SOURCE!
Q138: The Thread ID changes for my thread! 
Q139: Does X11 support multithreading ? 
Q140: Solaris 2 bizzare behavior with usleep() and poll() 
Q141: Why is POSIX.1c different w.r.t. errno usage? 
Q142: printf() anywhere AFTER pthread_create() crashes on HPUX 10.x 
Q143: Pthreads and Linux 
Q144: DEC release/patch numbering 
Q145: Pthreads (almost) on AS/400 
Q146: Can pthreads & UI threads interoperate in one application?
Q147: Thread create timings 
Q148: Timing Multithreaded Programs (Solaris) 
Q149: A program which monitors CPU usage? 
Q150: standard library functions: whats safe and whats not? 
Q151: Where are semaphores in POSIX threads? 
Q152: Thread & sproc (on IRIX) 
Q153: C++ Exceptions in Multi-threaded Solaris Process 
Q154: SCHED_FIFO threads without root privileges ? 
Q155: "lock-free synchronization" 
Q156: Changing single bytes without a mutex 
Q157: Mixing threaded/non-threadsafe shared libraries on Digital Unix 
Q158: VOLATILE instead of mutexes? 
Q159: After pthread_cancel() destructors for local object do not get called?!
Q160: No pthread_exit() in Java.
Q161: Is there anyway I can make my stacks red zone protected?
Q162: Cache Architectures, Word Tearing, and VOLATILE
Q163: Can ps display thread names?
Q164: (Not!) Blocking on select() in user-space pthreads.
Q165: Getting functional tests for UNIX98
Q166: To make gdb work with linuxthreads?
Q167: Using cancellation is *very* difficult to do right...
Q168: Why do pthreads implementations differ in error conditions?
Q169: Mixing threaded/non-threadsafe shared libraries on DU
Q170: sem_wait() and EINTR
Q171: pthreads and sprocs
Q172: Why are Win32 threads so odd?
Q173: What's the point of all the fancy 2-level scheduling??
Q174: Using the 2-level model, efficency considerations, thread-per-X
Q175: Multi-platform threading api
Q176: Condition variables on Win32 
Q177: When stack gets destroyed relative to TSD destructors?
Q178: Thousands of mutexes?
Q179: Threads and C++
Q180: Cheating on mutexes
Q181: Is it possible to share a pthread mutex between two distinct processes?
Q182: How should one implement reader/writer locks on files?
Q183: Are there standard reentrant versions of standard nonreentrant functions?
Q184: Detecting the number of cpus
Q185: Drawing to the Screen in more than one Thread (Win32)
Q186: Digital UNIX 4.0 POSIX contention scope
Q187: Dec pthreads under Windows 95/NT?
Q188: DEC current patch requirements
Q189: Is there a full online version of 1003.1c on the web somewhere?
Q190: Why is there no InterlockedGet?
Q191: Memory barrier for Solaris
Q192: pthread_cond_t vs pthread_mutex_t
Q193: Using DCE threads and java threads together on hpux(10.20)
Q194: My program returns enomem on about the 2nd create.
Q195: Does pthread_create set the thread ID before the new thread executes?
Q196: thr_suspend and thr_continue in pthread
Q197: Are there any opinions on the Netscape Portable Runtime?
Q198: Multithreaded Perl
Q199: What if a process terminates before mutex_destroy()?
Q200: If a thread performs an illegal instruction and gets killed by the system...
Q201: How to propagate an exception to the parent thread?
Q202: Discussion: "Synchronously stopping things" / Cheating on Mutexes
Q203: Discussion: Thread creation/switch times on Linux and NT.
Q204: Are there any problems with multiple threads writing to stdout?
Q205: How can I handle out-of-band communication to a remote client?
Q206: I need a timed mutex for POSIX
Q207: Does pthreads has an API for configuring the number of LWPs?
Q208: Why does Pthreads use void** rather than void*?
Q209: Should I use poll() or select()?
Q210: Where is the threads standard of POSIX ????
Q211: Is Solaris' unbound thread model braindamaged?
Q212: Releasing a mutex locked (owned) by another thread.
Q213: Any advice on using gethostbyname_r() in a portable manner?
Q214: Passing file descriptors when exec'ing a program.
Q215: Thread ID of thread getting stack overflow? 
Q216: Why aren't my (p)threads preemted?
Q217: Can I compile some modules with and others without _POSIX_C_SOURCE?
Q218: timed wait on Solaris 2.6?
Q219: Signal delivery to Java via native interface
Q220: Concerning timedwait() and realtime behavior.
Q221: pthread_attr_getstacksize on Solaris 2.6
Q222: LinuxThreads: Problem running out of TIDs on pthread_create
Q223: Mutexes and the memory model
Q224: Poor performance of AIO in Solaris 2.5?
Q225: Strategies for testing multithreaded code?
Q226: Threads in multiplatform NT 
Q227: Guarantee on condition variable predicate/pthreads?
Q228: Pthread API on NT? 
Q229: Sockets & Java2 Threads
Q230: Emulating process shared threads 
Q231: TLS in Win32 using MT run-time in dynamically loaded DLLs?
Q232: Multithreaded quicksort
Q233: When to unlock for using pthread_cond_signal()?
Q234: Multi-Read One-Write Locking problem on NT
Q235: Thread-safe version of flex scanner 
Q236: POSIX standards, names, etc
Q237: Passing ownership of a mutex?
Q238: NT fibers
Q239: Linux (v.2.0.29 ? Caldera Base)/Threads/KDE 
Q240: How to implement user space cooperative multithreading?
Q241: Tools for Java Programming 
Q242: Solaris 2.6, phtread_cond_timedwait() wakes up early
Q243: AIX4.3 and PTHREAD problem
Q244: Readers-Writers Lock source for pthreads
Q245: Signal handlers in threads 
Q246: Can a non-volatile C++ object be safely shared amongst POSIX threads?
Q247: Single UNIX Specification V2
Q248: Semantics of cancelled I/O (cf: Java)
Q249: Advice on using multithreading in C++?
Q250: Semaphores on Solaris 7 with GCC 2.8.1 
Q251: Draft-4 condition variables (HELP) 
Q252: gdb + linuxthreads + kernel 2.2.x = fixed :) 
Q253: Real-time input thread question
Q254: How does Solaris implement nice()?  
Q255: Re: destructors and pthread cancelation...  
Q256: A slight inaccuracy WRT OS/2 in Threads Primer 
Q257: Searching for an idea 
Q258: Benchmark timings from "Multithreaded Programming with Pthreads" 
Q259: Standard designs for a multithreaded applications? 
Q260: Threads and sockets: Stopping asynchroniously 
Q261: Casting integers to pointers, etc. 
Q262: Thread models, scalability and performance  
Q263: Write threaded programs while studying Japanese!  
Q264: Catching SIGTERM - Linux v Solaris 
Q265: pthread_kill() used to direct async signals to thread? 
Q266: Don't create a thread per client 
Q267: More thoughts on RWlocks 
Q268: Is there a way to 'store' a reference to a Java thread? 
Q269: Java's pthread_exit() equivalent?  
Q270: What is a "Thread Pool"?
Q271: Where did "Thread" come from?
Q272: Now do I create threads in a Solaris driver?
Q273: Synchronous signal behavior inconsistant?
Q274: Making FORTRAN libraries thread-safe?
Q275: What is the wakeup order for sleeping threads?
Q276: Upcalls in VMS?
Q277: How to design synchronization variables?
Q278: Thread local storage in DLL?
Q279:  How can I tell what version of linux threads I've got?
Q280: C++ exceptions in a POSIX multithreaded application?
Q281: Problems with Solaris pthread_cond_timedwait()?
Q282: Benefits of threading on uni-process
Q283: What if two threads attempt to join the same thread?
Q284: Questions with regards to Linux OS?
Q285: I need to create about 5000 threads?
Q286:  Can I catch an exception thrown by a sla
Q287: _beginthread() versus CreateThread()?
Q288: Is there a select() call in Java??
Q289: Comment on use of VOLATILE in the JLS.?
Q290: Should I try to avoid GC by pooling objects myself??
Q291: Does thr_X return errno values? What's errno set to???
Q292: How I can wait more then one condition variable in one place?
Q293: Details on MT_hot malloc()?
Q294: Bug in Bil's condWait()?
Q295: Is STL considered thread safe??
Q296: To mutex or not to mutex an int global variable ??
Q297: Stack overflow problem ?
Q298: How would you allow the other threads to continue using a "forgotten" lock?
Q299: How unfair are mutexes allowed to be?
Q300: Additionally, what is the difference between -lpthread and -pthread? ?
Q301: Handling C++ exceptions in a multithreaded environment?
Q302: Pthreads on IRIX 6.4 question?
Q303: Threading library design question ?
Q304: Lock Free Queues?
Q305: Threading library design question ?
Q306: Stack size/overflow using threads ?
Q307: correct pthread termination?
Q308: volatile guarantees??
Q309: passing messages, newbie?
Q310: solaris mutexes?
Q311: Spin locks?
Q312: AIX pthread pool problems?
Q313: iostream libray and multithreaded programs ?
Q314: Design document for MT appli?
Q315: SCHED_OTHER, and priorities?
Q316: problem with iostream on Solaris 2.6, Sparcworks 5.0?
Q317: pthread_mutex_lock() bug ???
Q318: mix using thread library?
Q319: Re: My agony continues (thread safe gethostbyaddr() on FreeBSD4.0) ?
Q320: OOP and Pthreads?
Q321: query on threading standards?
Q322: multiprocesses vs multithreaded..??
Q323: CGI & Threads?
Q324: Cancelling detached threads (posix threads)?
Q325: Solaris 8 recursive mutexes broken?
Q326: sem_wait bug in Linuxthreads (version included with glibc 2.1.3)?
Q327: pthread_atfork??
Q328: Does anybody know if the GNU Pth library supports process shared mutexes?
Q329: I am trying to make a thread in Solaris to get timer signals.
Q330: How do I time individual threads?
Q331: I'm running out of IPC semaphores under Linux!
Q332: Do I have to abandon the class structure when using threads in C++?
Q333: Questions about pthread_cond_timedwait in linux.
Q334: Questions about using pthread_cond_timedwait.
Q335: What is the relationship between C++ and the POSIX cleanup handlers?
Q336: Does selelct() work on calls recvfrom() and sendto()?
Q337: libc internal error: _rmutex_unlock: rmutex not held.
Q338: So how can I check whether the mutex is already owned by the calling thread?
Q339: I expected SIGPIPE to be a synchronous signal.
Q340: I have a problem between select() and pthread...
Q341: Mac has Posix threading support.
Q342: Just a few questions on Read/Write for linux.
Q343: The man pages for ioctl(), read(), etc. do not mention MT-safety.
Q344: Status of TSD after fork()?
Q345: Static member function vs. extern "C" global functions?
Q346: Can i kill a thread from the main thread that created it?
Q347: What does /proc expose vis-a-vis LWPs?
Q348: What mechanism can be used to take a record lock on a file?
Q349: Implementation of a Timed Mutex in C++
Q350: Effects that gradual underflow traps have on scaling.
Q351: LinuxThreads woes on SIGSEGV and no core dump.
Q352: On timer resolution in UNIX.
Q353: Starting a thread before main through dynamic initialization.
Q354: Using POSIX threads on mac X and solaris?
Q355: Comments on ccNUMA on SGI, etc.
Q356: Thread functions are NOT C++ functions! Use extern "C"
Q357: How many CPUs do I have?
Q358: Can malloc/free allocate from a specified memory range?
Q359: Can GNU libpth utilize multiple CPUs on an SMP box?
Q360: How does Linux pthreads identify the thread control structure?
Q361: Using gcc -kthread doesn't work?!
Q362: FAQ or tutorial for multithreading in 'C++'?
Q363: WRLocks & starvation.
Q364: Reference for threading on OS/390.
Q365: Timeouts for POSIX queues (mq_timedreceive())
Q366: A subroutine that gives cpu time used for the calling thread?
Q367: Documentation for threads on Linux
Q368: Destroying a mutex that was statically initialized.
Q369: Tools for debugging overwritten data.
Q370: POSIX synchronization is limited compared to win32.
Q371: Anyone recommend us a profiler for threaded programs?
Q372: Coordinating thread timeouts with drifting clocks.
Q373: Which OS has the most conforming POSIX threads implementation?
Q374: MT random number generator function.
Q375: Can the main thread sleep without causing all threads to sleep?
Q376: Is dynamic loading of the libpthread supported in Redhat?
Q377: Are reads and writes atomic?
Q378: More discussion on fork().
Q379: Performance differences: POSIX threads vs. ADA threads?
Q380: Maximum number of threads with RedHat 255?
Q381: Best MT debugger for Windows...
Q382: Thread library with source code ? 
Q383: Async cancellation and cleanup handlers.
Q384: How easy is it to use pthreads on win32?
Q385: Does POSIX require two levels of contention scope?
Q386: Creating threadsafe containers under C++
Q387: Cancelling pthread_join() DOESN'T detach target thread?
Q388: Scheduling policies can have different ranges of priorities?
Q389: The entity life modeling approach to multi-threading.
Q390: Is there any (free) documentation?
Q391: Grafting POSIX APIs on Linux is tough!
Q392: Any companies  using pthread-win32?
Q393: Async-cancel safe function: guidelines?
Q394: Some detailed discussion of implementations.
Q395: Cancelling a single thread in a signal handler?
Q396: Trouble debugging under gdb on Linux.
Q397: Global signal handler dispatching to threads.
Q398: Difference between the Posix and the Solaris Threads?
Q399: Recursive mutexes are broken in Solaris?
Q400: pthreads and floating point attributes?
Q401: Must SIGSEGV be sent to the thread which generated the signal?
Q402: Windows and C++: How?
Q403: I have blocked all signals and don't get SEGV!
Q404: AsynchronousInterruptedException (AIE) and POSIX cancellation


=================================TOP===============================
 Q1: How fast can context switching be?  

In general purpose processors (SPARC, MIPS, ALPHA, HP-PA, POWER, x86) a
LOCAL thread context switch takes on the order of 50us.  A GLOBAL thread
context switch takes on the order of 100us.  However...

[email protected] (Abdelsalam Heddaya) writes:

>- Certain multi-threaded processor architectures, with special support
>  for on-chip caching of thread contexts can switch contexts in,
>  typically, less than 10 cycles, down to as little as one cycle.

The Tera machine switches with 0 cycles of overhead.

>  Such processors still have to incur a high cost when they run out of
>  hardware contexts and need to perform a full "context swap" with
>  memory.

Hmmm.  With 128 contexts/processors and 16 processors on the smallest
machine, we may be talking about a rare sitution.  Many people doubt
we'll be able to keep the machine busy, but you propose an
embarassment of riches/parallelism.

In any case, I disagree with the implication that a full context swap
is a problem to worry about.  We keep up to 2048 threads active at a
time, with others confined to memory.  The processors issues
instructions for the active threads and completely ignore the inactive
threads -- there's no swapping of threads between processor and memory
in the normal course of execution.  Instead, contexts are "swapped"
when one thread finishes, or blocks too long, or is swapped to disk,
etc.  In other words, at fairly significant intervals.

Preston Briggs

=================================TOP===============================

 Q2: What about special purpose processors?  

What are the distinctions between these special purpose processors and
the general purpose processors we're using?


??

=================================TOP===============================
 Q3: What kinds of issues am I faced with in async cancellation?  


Michael C. Cambria wrote:
> 
> In article <[email protected]>, [email protected] (Spike White) wrote:
> [deleted]
> > thread2()
> > {
> >    ...
> >    while(1) {
> >       pthread_setasynccancel(CANCEL_ON);
> >       pthread_testcancel();  /* if there's a pending cancel */
> >       read(...);
> >       pthread_setasynccancel(CANCEL_OFF);
> >       ...process data...
> >    }
> > }
> >
> > Obviously, you shouldn't use any results from the read() call that was
> > cancelled -- God knows what state it was when it left.
> >
> > That's the only main use I've ever found for async cancel.
> 
> I used something quite similar to your example (quoted above) in my
> original question.
> 
> Since the read() call itself is not async cancel safe according to Posix,
> is it even safe to do the above?  In general for any posix call which is
> not async cancel safe, my guess (and many e-mails to me agree) is to
> just not use it.
> 
> Using read() as an example, I'll bet everyone will agree with you not
> to use the results of the read() call.  However, the the motivation for
> my original question was, being as a call() is not async cancel safe,
> by canceling a thread when it is in one of these calls _may_ screw up
> other threads in general and other threads using the same fd in
> particular.  This is why I asked why one would use it.
> 
> In your example, if read() did anything with static data, the next read on
> that fd could have problems if a thread was cancelled while in the read().
> (Note:  if you don't like the "static data" example, substitute whatever
> you like for the implementation reason for read(), or any call, not being
> async cancel safe.  I used static data as an example only.)
> 
> Mike

Specifically, NO, it is NOT safe to call read() with async cancel. On some
implementations it may work, sometimes. In general, it *MAY* work if, on
the particular release of your particular operation system, read() happens
to be implemented with no user-mode code (aside from a syscall trap). In
most cases, a user mode cancel will NOT be allowed to corrupt kernel data.

However, no implementations make any guarantees about their implementation
of read(). It may be a syscall in one version and be moved partly into
libc in the next version.

Unfortunately, the OSF DCE porting guide made reference to the possibility
of using async cancel in place of synchronous system cancel capability on
platforms that don't support the latter. That was really too bad, and it
set a very dangerous precedent.

POSIX 1003.1c-1996 encourages all routines to document whether they are
async cancel safe. (Luckily the advice is in rationale -- which is to say
it's really just commentary and not part of the standard -- because it'd
be horrendously difficult to change the documentation of every single
routine in a UNIX system.) In practice, you should always assume that a
function is NOT async cancel safe unless it says that it IS. And you won't
see that very often.

Because, as has already been commented, async cancel really isn't very
useful. There is a certain small class of application that can benefit
dramatically from async cancel, for good response to shutdown requests in
long-running compute-bound threads. In a long and tight loop it's not
practical to call pthread_testcancel(). So in cma we provided async cancel
for those cases. In retrospect I believe that's probably one of the bad
parts of cma, which POSIX should have omitted. There may well have been
"hard realtime" people in the room who wanted to use it, though (the POSIX
threads standard was developed by roughly 10 "threads people" and 40 to 50
"realtime people").

------------------------------------------------------------------------
Dave Butenhof                              Digital Equipment Corporation
[email protected]                       110 Spit Brook Rd, ZKO2-3/Q18
Phone: 603.881.2218, FAX: 603.881.0120     Nashua, NH 03062-2711
                 "Better Living Through Concurrency"
------------------------------------------------------------------------


> In article <[email protected]>,
> Jose Luis Ramos =?iso-8859-1?Q?Mor=E1n?=   wrote:
> %   pthread_setcancelstate(PTHREAD_CANCEL_ENABLE,NULL);
> %   pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,NULL);
>
> I would guess that your problem comes from this. Asynchronous cancellation
> is almost never a good idea, but if you do use it, you should be really
> careful about whether there's anything with possible side-effects in your
> code. For instance, the C++ exception handler could be screwed up for your
> whole process if you cancel at a bad moment.
>
> Anyway, try taking out the asynchronous cancellation and see if the problem
> goes with it.

I'll put it a little more strongly than Patrick. The program is illegal. You
CANNOT call any function with asyncronous cancel enabled unless that function
is explicitly specified as "async-cancel safe". There are very few such
functions, and sleep() is not one of them. In fact, within the scope of the
POSIX and UNIX98 standards, with async cancel enabled you are allowed only to

  1. Disable asynchronous cancellation (set cancel type to DEFERRED)
  2. Disable cancellation entirely (set cancel state to DISABLE)
  3. Call pthread_cancel() [This is bizarre and pointless, but it is specified
     in the standard.]

If you call any other function defined by ANSI C, POSIX, or UNIX98 with async
cancel enabled, then your program is nonportable and "non conforming". It MAY
still be "correct", but only IF you are targeting your code to one specific
implementation of the standard that makes the NON-portable and NON-standard
guarantee, in writing, that the function you're calling actually is
async-cancel safe on that implementation. Otherwise, the program is simply
broken.

You can, of course, write your own async-cancel safe functions. It's not that
hard to do. In general, like most correct implementations of pthread_cancel(),
you simply DISABLE async cancellation on entry and restore the previous
setting on exit. But it's silly to do that very often. And, of course, that's
not the same as actually allowing async cancel. THAT is a much, much harder
job, except for regions of code that own no resources of any kind.

Asynchronous cancelation was designed for tight CPU-bound loops that make no
calls, and therefore would suffer from the need to call pthread_testcancel()
on some regular basis in order to allow responsiveness to cancellation
requests. That's the ONLY time or place you should EVER even consider using
asynchronous cancellation.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q4: When should I use these new thread-safe "_r" functions?  


David Brownell wrote:
> 
> If the "_r" versions are available at all times, use them but
> beware of portability issues.  POSIX specifies a pretty minimal
> set and many implementations add more (e.g. gethostbyname_r).
> Some implementations only expose the "_r" versions if you
> compile in a threaded environment, too.
> 
> - Dave

POSIX 1003.1c-1995 deliberately separates _POSIX_THREAD_SAFE_FUNCTIONS
from _POSIX_THREADS so that they can be easily implemented by
non-threaded systems. The "_r" versions aren't just thread-safe, they
are also much "cleaner" and more modular than the traditional forms.
(for example, you can have several independent readdir_r or strtok_r
streams active simultaneously).

The grand vision is that all UNIX systems, even though without threads,
would of course want to pick up this wonderful new set of interfaces. I
doubt you'll see them in any system without threads support, of course,
but it would be nice.

=================================TOP===============================
 Q5: What benchmarks are there on POSIX threads?  

In the book on POSIX.4 by B.Gallmeister there are some very useful POSIX
benchmark programs which allow to measure the real-time performance of an
operating system. However there is nothing on the threads of POSIX.4a!  Does
anybody know of a useful set of benchmark programs on these POSIX threads ??

Any help is greatly appreciated.

Markus Joos
CERN ECP/ESS
([email protected])


??
=================================TOP===============================
 Q6: Has anyone used the Sparc atomic swap instruction?  

Has anyone used the Sparc atomic swap instruction to safely build lists 
in a multithreaded application?  Any examples?  Any references?


Yes, but it would not help you if you use sun4c machines. ( No atomic
instructions..)  Thus you would be forced to use atomic in sun4m or later,
and spl stuff in sun4c.  Does not make a pretty picture. Why not use
mutex_lock/unlock and let the libraries worry about that. mutex_lock uses
atomic/spl stuff.

Sinan

[sun4c are SPARC v7 machines such as 4/110, SS1, SS1+, SS2, IPC,
IPX, EPC, EPX. sun4m are v8 machines including SS10, SS20, SS4, SS5, 4/690,
SS1000, SC2000. The UltraSPARC machines are SPARC v8+ (soon to be v9), but
have the same instructions as the sun4ms.]
=================================TOP===============================
 Q7: Are there MT-safe interfaces to DBMS libraries?  

A: In general, no.  My current understanding is that NO major DBMS
   vendor has an MT-safe client-side library. (1996)


Peter Sylvester wrote:
> 
> In article <[email protected]>, Andreas Reichenberger
>  wrote:
> 
> > Richard Moulding wrote:
> > >
> > > I need to interface to an Oracle 7 DB from DCE (non-Oracle)
> > > clients. We are planning to build our own  DCE RPC-stored
> > > procedure interface but someone must be selling something to do
> > > this, no?
> 
> I have the same problem with an Informix application which uses a DCE
> interface, and currently limit it to 1 thread coming in.  This works, but
> could be a bottleneck in busy environments, as other incoming RPCs are put
> in a queue (blocked) until the current one finishes.

... stuff deleted

 
> A potential way around this would be to fork off separate processes which
> then start their own connection to the database.  The parent then acts as a
> dispatcher for requests coming in.  I know the forking part works without
> DCE, but I suspect that you have to do all the forks before going into the
> DCE server listening mode.
> 
> I also thought I heard something about Oracle providing a thread safe
> library, maybe in 7.3.  Anyone know?
> 
> --
> Peter Sylvester
> MITRE Corp.
> Bedford, MA
> ([email protected])

This is exactly the way we handled the problem. We wrote a tool that
generates the complete dispatcher from an IDL file. The dispatcher (which is
virtually invisible to the clients and to the developers) distributes the
requests from the clients to its 'backends', which are connected to the
DB. The backends are implemented as single-threaded DCE Servers with the
Interface specified in the IDL File.

We added some features that are not in DCE, like 
  - asyncronous RPC's (the RPC returns immediately and the client can ask the
    dispatcher to return the state of the RPC (if it is done or still running)
    or request the RPC to be canceled) 
  - dividing the backends into classes. i.e. it's possible to have one class of
    backends for querying the database and another class for updates, etc. By
    assigning 2 backends to the query class and the rest of the backends to
    other classes you can limit the number of concurrent queries to 2 (because
    they are time consuming). The client has to specify which class is to be used
    for a RPC (we currently support up to 10 classes)

Context handles are used to tie a client to one backend for transactions which
require more than one RPC to be handled by the same backend (= DB Session).

The reason why the hell we had to do this anyway was to limit the number of
backend processes neccessary to support a few hundred PC clients. We
currently run it on AIX and Digital UNIX with Oracle and Ingres. However,
there's no reason why it shouldn't work on any UNIX platform which supports
OSF DCE (V1.1) and with any DB.

Feel free to contact me for more details...

See 'ya

=================================TOP===============================
 Q8: Why do we need re-entrant system calls?  

A:
[email protected] (Jeffrey P Bradford) wrote:
>Why do we need re-entrant system calls?  I know that it's so that
>system calls can be used in a multithreaded environment, but how often
>does one really have multiple threads executing the same system call?
>Do we really need system calls that can be executed by multiple
>threads, or would mutual exclusion be good enough?

Well, there have been some implimentations that felt (feel?) that mutual
exclusion is good enough. And, in fact, that will "thread safe" the
functions. But it runs havoc with performance, and things like
cancelability. Turns out that real applications have multiple threads calls
executing the same system call all the time. read() and write() are popular,
as are send() and recv() (On UNIX).

>I'm assuming that system calls can be designed intelligently enough so
>that, for example, if a process wants to perform a disk read, the
>process performs a system call, exits the system call (so another
>thread can perform a disk read), and then is woken up when the disk
>read is done.
>
>Jeff

[I assume the behavior you reference "leave the system call" means
"return to user space"]

That all depends on the OS. On UNIX, that is not the default
system call behavior. On VMS it is (Just two examples).

Brian Silver.

=================================TOP===============================
 Q9: Any "code-coverage" tools for MT applications?  

Is there an application that can help me with "code-coverage" for
MT applications?


A:

Upon which platform are you working?  I did performance profiling last week
on a MT app using prof & gprof on a Solaris 2.4 machine.  For code coverage,
I use tcov.  I suspect that most OS's w/ kernel threads have thread-aware
gprof and tcov commands.

--
Spike White          | [email protected]               | Biker Nerds
HaL Software Systems | '87 BMW K75S, DoD #1347     |  From  HaL
Austin, TX           |  http://www.halsoft.com/users/spike/index.html 
Disclaimer:  HaL, want me to speak for you?  No, Dave... 
=================================TOP===============================
 Q10: How can POSIX join on any thread?  

The pthread_join() function will not allow you to wait for "any" 
thread, like the UI function thr_join() will.  How can I get this?

A:
> >: I want to create a number of threads and then wait for the first
> >: one to finish, not knowing which thread will finish first.  But
> >: it appears pthread_join() forces me to specify exactly which of
> >: my threads I want to wait for.  Am I missing something basic, or
> >: is this a fundamental flaw in pthread_join()?
> >
> >:      Rich Stevens
> >
> >Good call.  I notice Solaris native threads have this support and the
> >pthreads implementations I've seen don't.  I wondered about this myself.
> >
> 
> Same here.  The situation I ran into was a case where once the main
> created the necessary threads and completed any work it was responsible
> for, it just needed to "hang-around" until all the threads completed
> their work before exiting.  pthread_join() for "any" thread in loop using
> a counter for the number of threads seemed the logical choice.  Then I
> realized Solaris threads supported this but POSIX didn't (along with
> reader/writer locks).  Oh well.
> 
> How about the Solaris SPLIT package.  Does it support the "wait for any"
> thread join?

This "wait for any" stuff is highly misleading, and dangerous in most real
threaded applications. It is easy to compare with the traditional UNIX "wait
for any process", but there's no similarity. Processes have a PARENT -- and
when a process "waits for any" it is truly waiting only for its own
children. When your shell waits for your "make" it CANNOT accidentally chomp
down on the termination of the "cc" that make forked off!

This is NOT true with threads, in most of the common industry threading
models (including POSIX 1003.1c-1995 and the "UNIX International" threads
model supported by Solaris). Your thr_join(NULL,...) call may grab the
termination status of a thread used to parallelize an array calculation
within the math library, and thus BREAK the entire application.

Without parent/child relationships, "wait for any" is not only totally
useless, it's outright dangerous. It's like the Win32 "terminate thread"
interface. It may seem "neat" on the surface, but it arbitrarily breaks all
shared data & synchronization invariants in ways that cannot be detected or
repaired, and thus CANNOT be used in anything but a very carefully
constructed "embedded system" type environment where every aspect of the
code is tightly controlled (no third-party libraries, and so forth). The
very limited enviroments where they are safe & useful are dramatically
outweighed by the danger that having them there (and usually very poorly
explained) encourages their use in inappropriate ways.

It really wouldn't have been hard to devise POSIX 1003.1c-1995 with
parent/child relationships. A relatively small overhead. It wasn't even
seriously considered, because it wasn't done in any of the reference
systems, and certainly wasn't common industry practice. Nevertheless,
there are clearly advantages to "family values" in some situations...
among them being the ability to usefully support "wait for any". But
wishful thinking and a dime gets you one dime...

------------------------------------------------------------------------
Dave Butenhof                              Digital Equipment Corporation
[email protected]                       110 Spit Brook Rd, ZKO2-3/Q18
Phone: 603.881.2218, FAX: 603.881.0120     Nashua, NH 03062-2711
                 "Better Living Through Concurrency"
------------------------------------------------------------------------

I find Dave's comments to be most insightful.  He hits on a big point
that I have hear a number of people express confusion about.  My 2-bits
to add:

  As a programmer we should be thinking about the availability of resources
-- when is something ready for use?  "Is the Matrix multiply complete?" "Has
the data request been satisfied?" etc.  thr_join() is often used as a cheap
substitute for those questions, because we ASSUME that when all N threads
have exited, that the computation is complete.  (Generally accurate, as long
as we control the entire program.  Should some lout get hired to maintain
our code, this assumption could become false in a hurry.)

  The only instance where we REALLY care if a thread has exited is when
the resource in question IS that thread (e.g., we want to un-mmap pages
we reserved for the stack or other rare stuff).

  So... the correct answer is "Don't do that."  Don't use thr_join()
to count threads as they exit.  Set up a barrier or a CV and have the threads
count down as they complete their work.  IE:

worker threads:

    do_work();
...     lock(M);
    running_threads--;
    if (running_threads == 0) cond_signal(CV);
    unlock(M);
    thr_exit();




"Master" thread:

... running_threads = N;
    create_workers(N);
    lock(M)
    while (running_threads != 0) cond_wait(M, CV);
    ...


-Bil


=================================TOP===============================
 Q11: What is the UI equivalent for PTHREAD_MUTEX_INITALIZER?  

A:

From the man page (man mutex_init):

Solaris Initialize
     The equivalent Solaris API used to  initialize  a  mutex  so
     that  it has several different types of behavior is the type
     argument passed to mutex_init().  No current type  uses  arg
     although  a  future  type  may  specify  additional behavior
     parameters via arg.  type may be one of the following:

     USYNC_THREAD        The mutex can synchronize  threads  only
                         in  this  process.  arg is ignored.  The
                         USYNC_THREAD Solaris mutex type for pro-
                         cess  scope  is  equivalent to the POSIX
                         mutex         attribute          setting
                         PTHREAD_PROCESS_PRIVATE.

     USYNC_PROCESS       The mutex  can  synchronize  threads  in
                         this  process and other processes.  Only
                         one process should initialize the mutex.
                         arg   is   ignored.   The  USYNC_PROCESS
                         Solaris mutex type for process scope  is
                         equivalent  to the POSIX mutex attribute
                         setting   PTHREAD_PROCESS_SHARED.    The
                         object  initialized  with this attribute
                         must  be  allocated  in  memory   shared
                         between  processes, i.e. either in Sys V
                         shared memory  (see  shmop(2)).   or  in
                         memory  mapped  to a file (see mmap(2)).
                         It is illegal to initialize  the  object
                         this  way and to not allocate it in such
                         shared memory.

     Initializing mutexes can also be accomplished by  allocating
     in  zeroed  memory  (default),  in  which  case,  a  type of
     USYNC_THREAD is assumed.  The same mutex must not be  simul-
     taneously  initialized  by  multiple  threads.  A mutex lock
     must not be re-initialized while in use by other threads.

     If default mutex attributes are used, the macro DEFAULTMUTEX
     can  be used to initialize mutexes that are statically allo-
     cated.

=================================TOP===============================
 Q12: How many threads are too many in one heavyweight process?    

How many are too many for a single machine?

A:

The answer, of course, is "it depends".

Presumably, the number of threads you're considering far outstrips the
number of processors you have available, so it's not really important
whether you're running on uni- or a multiprocessor, and it's not really
important (in this general case) whether the threads implementation has
any kernel support (presumably it doesn't on HP-UX, judging by your post
from 14 Feb 1996 14:31:42 -0500).  So, it comes down to what these
bazillion threads of yours are actually doing.  

If, for the most part, they just sit there waiting for someone to tickle
the other end of a socket connection, then you can probably create LOTS
before you hit "too many".  In this case it would depend on how much
memory is available to your process, in which to keep all of these
sleeping threads (and how much kernel resources are available to create
sockets for them ;-).

If, on the other hand, every one of these bazillion threads is hammering
away on the processor (trying to compute some fractal or something :-),
then creating any more threads than you have processors is too many.
That is, you waste time (performance, throughput, etc.) in switching
back and forth between the threads which you could be spending on
something useful.  That is, life would be better if you just created a
couple of threads and had them make their way through all the work at
hand.

Presumably, your application falls somewhere between the two extremes.
The idea is to design so that your "typical operating conditions"
involve a relatively small number of threads active at any one time.
Having extra ones running isn't a catastrophe, it just means that things
aren't quite as efficient as they otherwise might be.

-- 

------------------------------------------------------------------------
Webb Scales                                Digital Equipment Corporation
[email protected]                   110 Spit Brook Rd, ZKO2-3/Q18
Voice: 603.881.2196, FAX: 603.881.0120     Nashua, NH 03062-2711
         Rule #12:  Be joyful -- seek the joy of being alive.
------------------------------------------------------------------------

=================================TOP===============================
 Q13: Is there an atomic mutex_unlock_and_wait_for_event()?  

Is it possible for a thread to release a mutex and begin
waiting on an "event" in one atomic operation?  I can think of a few
convoluted ways to achieve or simulate this, but am wondering if
there's an easy solution that I'm missing.


A:

This isn't how you'd really want to look at things (POSIX). Figure out what
condition you're interested in and use a CV.

    =================================TOP===============

The NT4.0 beta has a new Win32 API, SignalObjectAndWait that will do what you
want. Sorry, it is not available in 3.51 or earlier.
    -John

Robert V. Head
=================================TOP===============================
 Q14: Is there an archive of this newsgroup somewhere?  

I believe http://www.dejanews.com keeps a 1 year record of every
newsgroup on the Usenet.  You can search it by author to get your
articles, then pick out individual threads...

=================================TOP===============================
 Q15: Can I copy pthread_mutex_t structures, etc.?  

"Ian" == Ian Emmons  writes:
In article <[email protected]> Ian Emmons  writes:

Ian> Variables of the data type pthread_t are, semantically speaking, a sort of 
Ian> reference, in the following sense:

Ian>     pthread_t tid1;
Ian>     pthread_t tid2;
Ian>     void* ret_val;

Ian>     pthread_create(&tid1;, NULL, some_function, NULL);
Ian>     // Now tid1 references a new thread.
Ian>     tid2 = tid1;
Ian>     // Now tid2 references the same thread.
Ian>     pthread_join(tid2, &ret;_val);

Ian> In other words, after creating the thread, I can assign from one pthread_t 
Ian> to another, and they all reference the same thread.  Pthread_key_t's (I 
Ian> believe) behave the same way.

    You should not copy one structure pthread_t to another pthread_t
...  it may not be portable.  In some implementations the pthread_t is
not simple a structure containing only a pointer and some keys .... it
is infact the REAL structure, which would then create two independant
structures which each can be manipulated individually reaping havoc.

Ian> An attributes object, like pthread_attr_t (or an unnamed semaphore sem_t), 
Ian> on the other hand does not behave this way.  It has value semantics, because 
Ian> you can't copy one into another and expect to have a second valid attribute 
Ian> object.

Ian> My question is, do pthread_mutex_t's and pthread_cond_t's behave as 
Ian> references or values?

    Same statement .... I have seen enough problems where someone copied
an initialized lock then continued to lock the two mutexes independently
creating very unwanted behavior.

-- 
William E. Hannon Jr.                         internet:[email protected]
AIX/DCE Technical Lead                                         whannon@austin
Austin, Texas 78758     Department ATKS/9132     Phone:(512)838-3238 T/L(678)
'Confidence is what you had, before you understood the situation.' Dr. Dobson


FOLLOWUP: For most programs, you should be passing pointers around, not
structures:


pthread_mutex_t     my_lock;


main()
{  ...
   foo(&my;_lock);
   ...
}

foo(pthread_mutex_t *m)
{
pthread_mutex_lock(m);
...
}
=================================TOP===============================
 Q16: After 1800 calls to thr_create() the system freezes. ??  

My problem is that the thread does not get freed or released back to the
system for reuse.  After 1800 calls to thr_create() the system freezes. ??
A: The default for threads in both UI and POSIX is for threads to be
   "undetached" -- meaning that they MUST be joined (thr_join()).  Otherwise
   they will not be garbage collected.  (This default is the wrong choice.  Oh
   well.)
=================================TOP===============================
 Q17: Compiling libraries which might be used in threaded or unthreaded apps?  


   What *is* the straight scoop on how to compile libraries which 
   might be used in threaded or unthreaded apps?  Hopefully the 
   "errno" and "putc()" macros will continue to work even if
   libthread isn't pulled in, so that vendors can make a single
   version of any particular library.

A: Always compile *all* libraries with the reentrancy flag (_REENTRANT for
   UI threads, _POSIX_C_SOURCE=199506L for POSIX threads). Otherwise some 
   poor soul will try to use your library and get hammered.  putc() and
   getc() WILL be slower, but you may use putc_unlocked() & getc_unlocked()
   if you know the I/O stream will be used safely.

   All Solaris libraries are compiled like this.
=================================TOP===============================
 Q18: What's the difference of signal handling for process and thread?   

   What's the difference of signal handling for process and thread? Do the
   signals divided into the types of process-based and thread-based which were
   treated differently in HP-RT? Is there any examples? I'd like to know how to
   initiate, mask, block, wait, catch, ...... the signals. How can I set the
   notification list (process or thread?) of SIGIO for both socket and tty
   using fcntl or ioctl? 

A: You probably want to buy one of the books that discuss this in detail.
   Here's the short answer:



    Signal masking is on per-thread based.
    But the signal handlers are per-process based.
    The synchronous signals like SIGSEGV, SIGILL etc will be 
    processed by the thread which caused the signal.

    The other signals will be handled by any ready thread which
    has the mask enabled for the signal.
    
    There are no special thread library for signal handling.
=================================TOP===============================
 Q19: What about creating large numbers of threads?  

I've asked a question about creating 2500 unbound threads. During these
days, I have written some more testing programs. Hope you would help me to
solve some more problems.

1. I have written a program that creates 10 threads. Then the 10 threads
each create 10 more threads. The 100 newly created threads each creates 10
more threads. In a SPARC 2000, if the concurrency level is 100, the program
takes 7 seconds to terminate. From a paper, unbound thread creation is
claimed to take only 56 usec. How comes my testing program is so slow on a
SPARC 2000 that has 20 CPUs? If I use a SPARC 10, the program only takes 1
second to terminate. Is SPARC 2000 slower than a SPARC 10?

2. Instead of creating 2500 threads, I have written a program that creates
200 threads and then kills them all and creates 200 threads and kills them
all and ..... After some while of creating and killing, the program hangs. I
use sigaction to set a global signal handler for the whole process. As the
program is so simple, I don't know where the problem is.

3. In addition, I have written a program that creates 1000 bound
threads. Each thread has a simple loop:

        while (1)
        {
            randomly read an entry of an array
        }

   This time, not only my program hangs, the whole SPARC 2000 hangs. I can't
reset the machine from console. Finally, I have to power down the machine.

Thanks in advance.


A:
=================================TOP===============================
 Q20: What about using sigwaitinfo()?  

>Here is what I am doing.  I am using the early access POSIX threads.
>My main program blocks SIGUSR1 and creates a number of threads.
>One of these threads is dedicated to this signal.  All it does is a
>sigwaitinfo on this signal, sets a flag when it returns, and exits.
>If I send the SIGUSR1 signal to the process using the kill command
>from another window, it does not seem to get it and the other threads
>(which are doing a calculation in a loop) report that SIGUSR1 is not
>pending.
>
>An earlier version of the program which used a signal handler to set
>the flag worked perfectly.
>
>Do you have any ideas on this?

A:

I assume you are using sigwaitinfo(3r) from libposix4.
Unfortunately, sigwaitinfo() is not MT-safe, i.e. does not work correctly
in an MT program, on 2.3/2.4. Use sigwait(2) - it should work on 2.3/2.4.
On 2.5 beta, sigwaitinfo() works.

If you really need the siginfo on 2.3/2.4, it is going to be hard, and the 
solution depends on whether you are running 2.3/2.4 but here is an 
alternative suggestion:

Programmers have used signals between processes as an IPC mechanism. Sounds
like you are trying to do the same. If this is the case, I would strongly
suggest that you use shared memory (see mmap(2)) between processes and
shared memory synchronization (using the SysV shared semaphores - see
semop(2)), or POSIX synchronization objects with the PTHREAD_PROCESS_SHARED
attribute. For example, you can set-up a region of shared memory protected
by a mutex and condition variable. The mutex and condition variable would
also be allocated from the shared memory and would be initialized with the
PTHREAD_PROCESS_SHARED attribute. Now, processes which share this memory
can use the mutex and condition variable as IPC mechanisms - any information
that needs to be passed between them can be passed through the shared
memory (alternative to siginfo :-)). To make this asynchronous, you can
have a thread dedicated to monitoring the shared memory area by waiting
on the condition variable. Now, whenever the signalling process wants to
send a signal, it instead issues a cond_signal on the condition variable.
The thread sleeping on this in the other (receiving) process wakes up
now and processes the information.

In general, signal handlers and threads, even though the system might support
this correctly, should not be used together. Signal handlers could be
looked upon as "substitute threads" when threads were not around in UNIX, 
and now that they are, the interactions between them can be complicated. 
You should mix them together only if absolutely necessary.

=================================TOP===============================
 Q21: How can I have an MT process communicate with many UP processes?  

>I have a multithreaded process, each thread in the multithreaded
>process wants to communicate with another single-threaded process,
>what is the good way to do that?
>
>Assume each thread in the multithreaded process is identical, i.e.
>they are generated using the same funcation call and each thread
>creates a shared memory to do the communication, will the generated
>shared memories operate independently if no synchronization provided?  

A:


  It sounds like you have the right idea.  For each thread/process pair,
build a shared memory segment and use that for communications.  You'll need
some sort of synchronization variable in that shared segement for
coordination.  

  There is no interaction between segments what-so-ever.
=================================TOP===============================
 Q22: Writing Multithreaded code with Sybase CTlib ver 10.x?  


>A customer is trying to write a multi-threaded application that also
>uses Sybase CTlib ver 10.x, and he is facing some limitations due to
>the Sybase library. 
>
>BOTTOM LINE: CTlib is reentrant, but according to Sybase is not usable
>in a multi-threaded context. That means it does NOT seem to be usable
>in an MT application.
>
>The purpose of this mail is NOT to get a fix for CTlib, but to try to
>find a workaround, if one exists...

A:

The workaround for the moment is to use the XA library routines from
Sybase, which are, in turn, based upon the TransArc package pthread*
routines.

We should be getting an alpha version of MT safe/hot CTlib towards the first
part of June 1995.  Also of potential interest is there will also be an early
version of native-threaded OpenServer soon as well, which really opens
up a lot of possibilities.

Chris Nicholas
SunSoft Developer Engineering
--------------------------------------------------------------
=================================TOP===============================
 Q23: Can we avoid preemption during spin locks?  

>    A while ago I asked you for information on preemption control
> interfaces (in-kernel) which might be available in Solaris2.x. I am
> looking for ways of lowering number of context switches taken as the
> result of adaptive muxtex contention. We have a number of places a
> lock is taken and held for a few scant lines of C. It would be great
> to prevent preemption during these sections of code.

A:

  You're obvious writing a driver of some sort. (Video driver I'd guess?)
And you're VERY concerned with performance on *MP* machines (UPs be damned).
You have tested you code on a standardized, repeatable benchmark, and you
are running into a problem.  You have solid numbers which you are absolutely
certain of.  Right?

  You'll have to excuse my playing the heavy here, but you're talking deep
do-do here, and I don't want to touch it unless I'm totally convinced I (and
you) have to.

  You could set the SPL up to turn off all interrupts.  It would slow your
code down quite a bit though.  The probablity of preemption occuring over "a
few scant lines of C" (i.e., a few dozen instructions) approaches zero.
Regularly suffering from preemption during just these few instructions would
be a VERY odd thing.  I am hard pressed to INVENT a situation like this.
Are you absolutely, totally, completely, 100% certain you're seeing this?
Are you willing to put $10 on it?

=================================TOP===============================
 Q24: What about using spin locks instead of adaptive spin locks?  
> 
>    I also would like to know more about something I saw in
> /usr/include/sys/mutex.h. It would appear that it possible to 
> create pure spinning locks (MUXTEX_SPIN) as opposed to the default 
> adaptive mutexes (MUTEX_ADAPTIVE_STAT). These might provide the kind 
> of control I am looking for assuming that these are really supported 
> and not some bastard orphan left over.

A:

  If I understand the question, the answer is "no".  That's what an adaptive
mutex is for.  It optimizes a spin lock to sleep if there's no value in
spinning.  If you use a dumb spin lock instead, you are GUARANTEED to run
slower.
=================================TOP===============================
 Q25: Will thr_create(...,THR_NEW_LWP) fail if the new LWP cannot be added?  

>    Does Sun's implementation of thr_create(...,THR_NEW_LWP) fail
>to create the multiplexed thread if the new LWP cannot be added to the
>multiplexing pool?  The unixware docs indicate Novell's implementation
>of thr_create() uses THR_NEW_LWP as a hint to the implementation to
>increase the pool size.  They also do not state the behavior if the
>new lwp cannot be created.  What is the official statement?

A:

  It should not create a new thread if it returns EAGAIN.  Mind you, you're
fairly unlikely EVER to see this happen in a real program.  (You'll see it
in bugs & in testing/design.)
=================================TOP===============================
 Q26: Is the LWP released upon bound thread termination?  

>  In the sun implementation, if you create a bound
>thread, and the thread eventually terminates, is the LWP released
>upon termination, or upon thr_join with the terminated thread?

A:

  Yes, a bound thread's LWP is released.  This should not affect your
programming at all.  Use thr_setconcurrency() & leave it at that.
=================================TOP===============================
 Q27: What's the difference between pthread FIFO the solaris threads scheduling?  

A:  Very little.

=================================TOP===============================
 Q28: I really think I need time-sliced RR.  

>Well, i really think I need time-sliced RR. Since I'm making an 
>multithreaded implementation of a functional concurrent process-
>oriented  language. MT support is needed to get good usage
>of multi CPU machines and better realtime. Today processes are custom 
>user-level and the runtime system delivers the scheduling. And the
>language semantic is that processes are timesliced RR.
>Changing the sematic is not realistic. I really hope the pthreads
>will spec RR timeslicing, it would make things easier.

A:

  Think VERY carefully.  When will you ever *REQUIRE* RR scheduling?  And
why?  Remember, you've never had it ever before, so why now?  (There may be
a reason, but it had better be good.)  Scheduling should normally be
invisible, and forcing up to user-awareness is generally a bad thing.

>For the moment, since this will only be a prototype, bound threads
>will do but bot in a real system with a couple with houndreds of
>threads/processes.
>
>Convince me I don't need RR timeslicing, that would make things easier.
>Or how do I make my own scheduler in solaris, or should I stay with
>bound threads?

  OK.  (let me turn it around) Give one example of your program which will
fail should thr 3 run before thr 2 where there is absolutely NO
synchronization involved.  With arbitrary time-slicing of course.  I can't
think of an example myself.  (It's just such a weird depencency that I
can't come up with it.  But I don't know everything...)
=================================TOP===============================
 Q29: How important is it to call mutex_destroy() and cond_destroy()?  

here is how I init serval of my threading variables

    mutex_init( &lock;, USYNC_PROCESS, 0 );
    cond_init( ¬Busy;, USYNC_PROCESS, 0 );
   
The storage for the variables is in memory mapped file. once I have
opened the file, I call unlink to make sure it will be automatically
cleaned up. How important is it to call mutex_destroy() and
cond_destroy()? Will I wind up leaking some space in the kernel is I
do not call these functions?

A:
=================================TOP===============================
 Q30: EAGAIN/ENOMEM etc. apparently aren't in ?!  

A:
  'Course not.  :-)

  They're in errno.h.  pthread_create() will return them if something goes
wrong.  Be careful, ERRNO is NOT used by the threads calls.
=================================TOP===============================
 Q31: What can I do about TSD being so slow?  
 Q32: What happened to the pragma 'unshared' in Sun C?  

   I read about a pragma 'unshared' for the C-compiler in some Solaris-thread
   papers. The new C-3.01 don't support the feature anymore I think. There is
   no hint in the Solaris 2.4 Multithread Programming Guide. But the new
   TSD is very slow. I tested a program with direct register allocation under
   gcc (asm "%g3") instead of calling the thr_getspecific procedure and it was 
   over three times faster. Can I do the same thing or something else with the 
   Sun C-compiler to make the C-3.01 Code also faster?

A:

The "thread local storage" feature that was mentioned in early papers
about MT on Solaris, and the pragma "unshared", were never
implemented.  I know what you mean about the performance of TSD.  It
isn't very fast.  I think the key here is to try to structure your
program so that you don't rely too much on thread specific data, if
that's possible.

The SPARC specification reserves those %g registers for internal use.
In general, it's dangerous to rely on using them in your code.
However, SC3.0.1 does not use the %g registers in any user code. It
does use them internally, but never across function calls, and never
in user code.  (If you do use the %g registers across function calls,
be sure to save and restore the registers.)

You can accomplish what gcc does with the "asm" statement by writing
what we call an "inline function template."  Take a look at the math
library inline templates for an idea on how to do that, and see the
inline() man page.  You might also want to take a look at the
AnswerBook for SPARC Assembly Language Programming, which is found in
the "Solaris 2.x Software Developer Answerbook".  The latest part
number for that is 

801-6649-10     SPARC ASSEMBLY LANGUAGE REFERENCE MANUAL REV.A AUG 94

The libm templates are found in /opt/SUNWspro/SC3.0.1/lib/libm.il.
Inline templates are somewhat more work to write, as compared to using
gcc's "asm" feature, but, it's safer.  I don't know about the
robustness of code that uses "asm" - I like gcc, and I use it, but
that particular feature can lead to interesting bugs.

Our next compiler, SC4.0 (coming out in late 1995) will use the %g
registers more aggressively, for performance reasons.  (Having more
registers available to the optimizer lets them do more optimizations.)
There will be a documented switch, -noregs=global (or something like
that) that you will use to tell the SC4.0 NOT to use the global
registers.    When you switch to SC4.0, be sure to read the cc(1) man
page and look for that switch.  
=================================TOP===============================
 Q33: Can I profile an MT-program with the debugger?  

   Can I profile an MT-program with the debugger and a special MT-license
   or do I need the thread-analyser?

A:

The only profiling you can do right now for an MT program is what you
get with the ThreadAnalyzer.  If you have the MT Debugger and SC3.0.1,
then, you should also have a copy of the ThreadAnalyzer (it was first
shipped on the same CD that had SC3.0.1) Look for the binary "tha"
under /opt/SUNWspro/bin.  

The "Collector" feature that you can use from inside the Debugger
doesn't work with MT programs.  Future MT-aware-profiling tools will
be integrated with the Debugger - is that where you'd like to use
profiling?
=================================TOP===============================
 Q34: Sometimes the specified sleep time is SMALLER than what I want.  

>I have a program that generates UDP datagrams at regular intervals.
>It uses real time scheduling for improved accuracy.
>(The code I used is from the Solaris realtime manual.)
>
>This helps, but once in a while I do not get the delay I wanted.
>The specified sleep time is SMALLER (i.e. faster) than what I want.
>
>I use the following procedure for microsecond delays
>
>void
>delay(int us) /* delay in microseconds */
>{
>    struct timeval tv;
>
>    tv.tv_sec = us / 1000000;
>    tv.tv_usec = us % 1000000;
>    (void)select( 0, (fd_set *)NULL, (fd_set *)NULL, (fd_set *)NULL, &tv; );
>
>}
>
>
>As I said, when I select a delay, occasionally I get a much smaller delay.
>
>examples:
>    Wanted: 19,776 microseconds, got: 10,379 microseconds
>    Wanted:    910 microseconds, got:    183 microseconds
>
>
>As you can see, the error is significant when it happens.
>It does not happen often. (0.5% of the time)
>
>I could use the usleep() function, but that's in the UCB library.
>Anyone have any advice?

A:

First of all, you can not do a sleep implementation in any increments
other than 10 milliseconds (or 1/HZ variable).

Second, there is a bug in the scheduler (fixed in 2.5) that may
mess up your scheduling in about 1 schedules around every
300,000 or so. 

Third, A much better timing interface will be available in
Solaris 2.6 (or maybe  earlier) thru posix interfaces. That
should give you microsecond resolution with less than 
50 microseconds latency.

Sinan
=================================TOP===============================
 Q35: Any debugger that single step a thread while the others are running?  

|>  Has anyone looked into the possibility of doing a MT debugger
|> that will allow you to single step a thread while the others
|> are running? This will probably require a debugger that attaches
|> a debugger thread to each thread...

A:

This was the topic of my master's thesis. You might check:

http://plg.uwaterloo.ca/~mkarsten

and follow the link to the abstract or the full version.

Martin
    =================================TOP=================

We have used breakpoint debugging to debug threads programs. We have
implemented a debugger that enables the user to write scripts to debug
programs (not limited to threads programs). This is made possible by a Tcl
interface atop gdb and hooks in gdb, that exports some basic debugger
internals to the user domain.  Thus allowing the user to essentially write
his own Application Specific debugger.

Please see the following web page for more info on the debugger

http://www.eecs.ukans.edu/~halbhavi/debugger.html
or
http://www.tisl.ukans.edu/~halbhavi/debugger.html

Cheers
Sudhir Halbhavi
[email protected]
=================================TOP===============================
 Q36: Any DOS threads libraries?  

> Is there any way or does anyone have a library that will allow to program
> multitreads.. I need it for SVGA mouse functions.. I use both C++ and
> Watcom C++, 

A:

I use DesqView for my DOS based multi-thread programs.  (Only they don't call
them threads, they call them tasks....)  I like the DesqView interface to 
threads better than the POSIX/Solaris interface, but putting up with DOS was
almost more than I could stand.
=================================TOP===============================
 Q37: Any Pthreads for Linux?  

See: http://pauillac.inria.fr/~xleroy/linuxthreads/
http://sunsite.unc.edu/pub/Linux/docs/faqs/Threads-FAQ/html

Linux has kernel-level threads now and has had a thread-safe libc for a
while.  With LinuxThreads, you don't have to worry about things like your
errno, or blocking system calls. The few standard libc functions that are
inherently not thread safe (due to using static data areas) have been
augmented with thread-safe alternatives.

LinuxThreads are not (fully) POSIX, however. 
   
                   -----------------

I'm quite familiar with Xavier's package. He's done an awesome job given
what he had to work with. Unfortunately, the holes are large, and his
valiant attempts to plug them result in a larger and more complicated
user-mode library than should be necessary, without being able to
completely resolve the problems.

Linux uses clone() which is not "kernel-level threads", though, with
some proposed (and possibly pending) extensions in a future version of
the kernel, it could become that. Right now, it's just a way to create
independent processes that share some resources. The most critical
missing component is the ability to create multiple executable entities
(threads) that share a single PID, thereby making those entities threads
rather than processes.

Linuxthreads, despite using the "pthread_" prefix, is NOT "POSIX
threads" (aka "pthreads") because of the aforementioned substantial and
severe shortcoming of the current implementation based on clone().
Without kernel extensions, a clone()-based thread package on Linux
cannot come close to conforming to the POSIX specification. The common
characterization of Linuxthreads as "POSIX threads" is incorrect and
misleading. This most definitely is not "a true pthreads
implementation", merely a nonstandard thread package that uses the
"pthread" prefix.

Note, I'm not saying that's necessarily bad. It supports much of the
interface, and unlike user-mode implementations (which also tend to be
far more buggy than Linuxthreads), allows the use of multiple
processors.  Linuxthreads is quite useful despite its substantial
deficiencies, and many reasonable programs can be created and ported
using it. But it's still not POSIX.

=================================TOP===============================
 Q38: Any really basic C code example(s) and get us newbies started?  

>Could one of you threads gods please post some really, really basic C code
>example(s) and get us newbies started?  There just doesn't seem to be any other
>way for us to learn how to program using threads.

A:

The following is a compilation of all the generous help that was posted or mailed to me 
concerning the use of threads in introductory programs.  I apologize for it not being
edited very well...  (Now I just need time to go through all of these)

Here's all of the URL's:

http://www.pilgrim.umass.edu/pub/osf_dce/contrib/contrib.html
http://www.sun.com/workshop/threads
http://www.Sun.COM/smi/ssoftpress/catalog/books_comingsoon.html
http://www.aa.net/~mtp/


--Carroll
=================================TOP===============================
 Q39: Please put some Ada references in the FAQ.  

A:

Most Ada books will introduce threading concepts.  Also, check out Windows
Tech Journal, Nov. 95 for more info on this subject.
=================================TOP===============================
 Q40: Which signals are synchronous, and whicn are are asynchronous?  

>I have another question. Since we must clearly distinguish the
>sinchronous signals from the asynchronous ones for MT, is there any
>documentation on which is which? I could not find any.

A:

In general, independent of MT, this is an often mis-understood area of
signals.  The adjective: "synchronous"/"asynchronous" cannot be applied to a
signal.  This is because any signal (including normally synchronously
generated signals such as SIGSEGV) could be asynchronously generated using
kill(2), _lwp_kill(2) or thr_kill(3t).

e.g. SIGSEGV, which is normally synchronously generated, can also be sent
via kill(pid, SIGSEGV), in which case it is asynchronously generated. So
labelling SIGSEGV as synchronous and a program that assumes this, would be
incorrect.

For MT, a question is: would a thread that caused the generation of a signal
get this signal?

If this is posed for a trap (SIGSEGV, SIGBUS, SIGILL, etc.), the answer is:
yes - the thread that caused the trap would get the signal.  But the handler
for the trap signal, i.e. a SIGSEGV handler, for example, cannot assume that
the handler was invoked for a synchronously generated SIGSEGV (unless the
application knows that it could not have receieved a SIGSEGV via a kill(),
or thr_kill()).

If this question is posed for any other signal (such as SIGPIPE, or the
real-time signals) the answer should not really matter since the program
should not depend on whether or not the thread that caused the signal to be
generated, receives it. For traps, it does matter, but for any other signal,
it should not matter.

FYI: On 2.4 and earlier releases, SIGPIPE, and some other signals were sent
to the thread that resulted in the generation of the signal, but on 2.5, any
thread may get the signal. The only signals that are guaranteed to be sent
to the thread that resulted in its generation, are the traps (SIGILL,
SIGTRAP, SIGSEGV, SIGBUS, SIGFPE, etc.). This change should not matter since
a correctly written MT application would not depend on the synchronicity of
the signal generation for non-traps, given the above description of signal
synchronicity that has always been true.

-Devang
=================================TOP===============================
 Q41: If we compile -D_REENTRANT, but without -lthread, will we have problems?  

>Hi -
>
>I had posed a question here a few weeks ago and received a response. Since
>then the customer had some follow-on questions. Can anyone address this
>customer's questions:
>
>(note: '>' refers to previous answer we provided customer)
>
>> If only mutexes are needed to make the library mt-safe, the library writer 
>> can do the following to enable a single mt-safe library to be used by both 
>> MT and single-threaded programs:
>
>Actually, we are only using the *_r(3c) functions, such as strtok_r(3c),
>getlogin_r(3c), and ctime_r(3c).  We are not actually calling thr_*,
>mutex_*, cond_*, etc. in the libraries.
>
>We want to use these *_r(3c) library functions instead of the normal
>non-MT safe versions (such as strtok(), ctime(), etc.), but if we compile
>the object files with -D_REENTRANT, but do not link with -lthread, will
>we have problems?

A:


No - you will not have any problems, if you do not link with -lthread.

But if your library is linked into a program which uses -lthread, then:

You might have problems in a threaded program because of how you allocate 
and use the buffers that are passed in to the *_r routines.

The usage of the *_r routines has to be thread-safe, or re-entrant in
the library. The *_r routines take a buffer as an argument. If the library
uses a global buffer to be passed to these routines, and does not protect
this buffer appropriately, the library would be unsafe in a threaded program.

Note that here, the customer's library has to do one of the following to ensure
that their usage of these buffers is re-entrant:

- if possible, allocate the buffers off the stack - this would be per-thread
  storage and would not require the library to do different things depending
  on whether the library is linked into a threaded program or not.

- if the above is not possible:

On any Solaris release, the following may be done: (recommended solution):

    - use mutexes, assuming that threads are present, to protect the 
      buffers. If the threads library is not linked in, there are dummy 
      entry points in libc for mutexes which do nothing - and so this 
      will compile correctly and still work. If the threads library is 
      linked in, the mutexes will be real and the buffers will be 
      appropriately protected.

On Solaris 2.5 only:

    - if you do not want to use mutexes for some reason and want to use
      thread-specific data (TSD) if threads are present (say), then on 2.4
          you cannot do anything. On 2.5, though, one of the following may be 
      done:
 
    (a) on 2.5, you can use thr_main() to detect if threads are linked in 
          or not. If they are, carry out appropriate TSD allocation of buffers.

    (b) If you are sure only POSIX threads will be used (if at all), and you
      do not like the non-portability of thr_main() which is not a POSIX
      interface, then, on 2.5, you can use the following (hack) to detect if
      pthreads are linked in or not: you need the #pragma weak declaration 
      so that you can check if a pthreads symbol is present or not. If 
      it is, then pthreads are linked in, otherwise they are not. Following
      is a code snippet which demonstrates this. You can compile it with
      both -lpthread and without. If compiled without -lpthread it prints
      out the first print statement. If compiled with -lpthread, it prints
      out the second print statement. I am not sure if this usage of
      #pragma weak is any more portable than using thr_main().

        #include 

        #pragma weak pthread_create

        main()
        {
            if (pthread_create == 0) {
                printf("libpthread not linked\n");
            } else {
                printf("libpthread is present\n");
                /*
                 * In this case, use Thread Specific Data
                 * or mutexes to protect access to the global
                 * buffers passed to the *_r routines.
                 */
            }
        }




-Devang

=================================TOP===============================
 Q42: Can Borland C++ for OS/2 give up a TimeSlice?  

Johan>    Does anyone know if Borland C++ for OS/2 has a function that could be 
Johan>    used within a THREAD to give up a TimeSlice.

A:

    If all you want to do is give up your timeslice
        DosSleep(0)
however if you are the highest priority thread, you will be immediately dispatched
again, before other threads.  Even when all the threads are the same priority,
my understanding is that the OS/2 operating system has a degradation algorithm
for the threads in a process ... so even if you DosSleep with the "same" priority
your thread still could be dispatched immediately --- depending on the
degradation algorithm.

    If you want to sleep to next clock tick
        DosSleep(1)
works, because the system round the 1 up to the next clock tick value.
This should allow other threads in your process to be dispatched.

    Both are valid semantics, depending on what you would prefer.
--
William E. Hannon Jr.                         internet:[email protected]
DCE Threads Development                                        whannon@austin
=================================TOP===============================
 Q43: Are there any VALID uses of suspension?  

    UI threads, OS/2 and NT all allow you to suspend a thread.  I have yet to
  see a program which does not go below the API (ie debuggers, GCs, etc.), but
  still uses suspension.  I don't BELIEVE there is a valid use.  I could be
  wrong.

A:

I'll bite.  Whether we "go below the API" or not is for you to decide.
Our product, ObjectStore, is a client-server object-oriented database
system.  For the purpose of this discussion, it functions like a
user-mode virtual memory system: We take a chunk of address space
and use it as a window onto a database; if the user touches an address
within our special range, we catch the page fault, figure out which
database page "belongs" there, and read that page from the server.  After
putting the page into place, we continue the faulting instruction, which
now succeeds, and the user's code need never know that it wasn't there
all the time.

This is all fine for a single-threaded application.  There's a potential
problem for MT applications, however; consider reading a page from a
read-only database.  Thread A comes along and reads a non-existent page.
It faults, the fault enters our handler, and we do the following:
    get data from server
    make page read-write    ;open window
    copy data to page
    make page read-only ;close window
During the window between the two page operations, another thread can
come along and read invalid data from the page, or in fact write the
page, with potentially disastrous effect.

On Windows and OS/2, we do the following:
    get data from server
    suspend all other threads
    make page read-write
    copy data to page
    make page read-only
    resume all other threads
to prevent the "window" from opening.  On OS/2, we use DosEnterCritSec,
which suspends all other threads.  On NT, we use the DllMain routine
to keep track of all the threads in the app, and we call SuspendThread
on each.  We're very careful to keep the interval during which threads
are suspended as brief as possible, and on OS/2 we're careful not to call
the C runtime while holding the critical section.

On most Unix systems, we don't have to do this, because mmap() has the
flexibility to map a single physical page into two or more separate
places in the address space.  This enables us to do this:
    get data from server
    make read-write alias of page, hidden from user
    copy data to alias page
    make read-only page visible to user
The last operation here is atomic, so there's no opportunity for other
threads to see bogus data.  There's no equivalent atomic operation on
NT or OS/2, at least not one that will operate at page granularity.
    =================================TOP==============
Since you do not like Suspend/Resume to be available to user level apis,
I thought the following set of functions (available to programs)
in WinNT (Win32) might catch your interest :) :

CreateRemoteThread -- allows you to start a thread in another process's
address space.. The other process may not even know you've done it
(depending on circumstances).  Supposedly, with full security turned
on (off by default!) this won't violatge C2 security.

SetThreadContext/GetThreadContext - Just lke it sounds.  You can
manipulate a thread's context (CPU registers, selectors, etc!).

Also, you can forcibly map a library (2-3 different ways: createremotethread
can allow this as well) to another proces's address space (that is, you
can map a DLL of yours to a running process).  Then, you can do
things like spawn off threads, after you have invisibly mapped your DLL
into the space.  Yes, there is potential for abuse (and for interestiing
programs).

But, microsoft has a use for these things.  They can help you subclass
a window on the desktop for instance.  If you wanted to make say
Netscape's main window beep twice every time it repaints, you could
map a DLL into netscape's address space, subclass the main window
(subclass == "Send ME the window's messages, instead of sending it to
the window -- i'll take care of everything!"), and watch for PAINTs
to come through.

Anyway, don't mean to waste your time.  Just thought you might find
it interesting that a user can start additional threads in someone else's
process, change thread context forcibly (to a decent degree), and
even latch onto a running process in order to change its behavior, or
just latch on period to run a thread you wrote in another proceses's
address space.




=================================TOP===============================
 Q44: What's the status of pthreads on SGI machines?  
>> We are considering porting of large application from Concurrent Computer
>> simmetrical multiprocessor running RTU-6 to one of the Silicon Graphics
>> multiprocessors running IRIX (5.3?).
>> 
>> Our application uses threads heavily. Both so-called user threads and 
>> kernel threads are required with a fair level of synchronization 
>> primiteves support and such.
>> 
>> My question is: what kind of multi-threaded application programming 
>> support is available in IRIX? 
>> 
>> Reading some of the SGI technical papers available on their WWW page 
>> just confuses me. I know that Posix threads or Solaris-type 
>> LWP/threads supports would be OK. 

A:

POSIX thread support in IRIX is more than a rumor - pthreads are currently 
scheduled to be available shortly after release of IRIX 6.2 (IRIX 6.2 is 
currently scheduled for release in Feb 96).  If you are interested in 
obtaining pthreads under IRIX as soon as possible, I would recommend 
contacting your local SGI office.
-- 
Bruce Johnson, SGI ASD                 
Real-time Applications Engineering          
=================================TOP===============================
 Q45: Does the Gnu debugger support threads?  

A:

An engineer at Cygnus is implementing thread support in gdb for Solaris.
No date for completion is given.
=================================TOP===============================
 Q46: What is gang scheduling?  

A:

Gang Scheduling is described a variety of ways. Generally the
consistent thread is that a GS gives a process all the processors at
the same time (or none for a time slice). This is most helpful for
"scientific apps" because the most common set up is something like

    do i=1, bignum
       stuff
       more stuff
       lots more stuff
    end do

the obvious decomposition is bignum/nproc statically allocated. Stuff
and friends take very close to the same time per chunk, so if you get
lucky it all happens in one chime (viz. one big clock). Else it takes
precisely N chimes with no leftovers. When unlucky, it's N chimes +
cleanup for stragglers.

Virtually all supercomputers do this, they may not even bother to give
it a special name. SGI makes this explicit (and supported).

On SPARC/Solaris there is no way for the compiler to know if we'll get
the processors requested or when. So you can suffer multiple chime
losses quite easily.

One can reallocate processor/code on the fly, but with increased overhead.
=================================TOP===============================
 Q47: LinuxThreads linked with X11, calls to X11 seg fault. 


You can't rely on libraries that are not at the very least compiled
with -DREENTRANT to do anything reasonable with threads.  A vanilla
X11 build (with out -DREENTRANT and without XTHREADS enabled) 
will likely behave badly with threads.  

It's not terribly hard to build X with thread support these days,
especially if you're using libc-6 with builtin LinuxThreads.  Contact
your Linux distribution maintainer and insist on it.  Debian has just 
switched to a thread-enabled X11 for their libc6 release; has any other
distribution? 

Bill Gribble
=================================TOP===============================
 Q48: Are there Pthreads on Win32?  

Several answers here.  #1 is probably the best (most recent!).


A: Yes, there is a GNU pthreads library for Win32.  It is still under
   active development, but you can find out more by looking at
   http://sourceware.cygnus.com/pthreads-win32/

   (This is a combination of Ben Elliston & John Bossom's work. & others?)


Also:

Well, Dave Butenhof will probably kill me for saying this, but Digital has a
pthreads implementation for WIN32. I bug them occasionally about packaging
up the header and dll and selling it separately (for a reasonable price, of
course). I think it's a great idea. My company has products on NT and UNIX,
so it would solve some painful portability issues for us.  This
implementation uses the same "threads engine" that Digital uses, rather

than just some wrappers on NT system services.

So, maybe if a few potential customers join me in asking Digital for this,
we'll get somewhere.  What say, Dave?

================

I have such a beast...sort of.

I have a pthreads draft 4 wrapper that is (nearly) complete and has been
in use for a while (so it seems to work!).

About 6 weeks back I changed this code to provide a draft 10 interface. This
code has however not yet been fully tested nor folded into my projects.
Casting my mind back (a lot has happened in 6 weeks!) I seem to remember
one or two small issues where I wasn't sure of the semantics; I was working
from a document I picked up at my last job which showed how to migrate
from pthreads 4 to pthreads 10, rather than a copy of the standard.

If anyone wants this code, I can make it available.

Ian
[email protected]

        ================
> > As far as I know, there is no pthreads implementation for NT.  However,
> > ACE provides a C++ threads wrapper which works on pthreads, and on NT
> > (and some others).
>
> Well, Dave Butenhof will probably kill me for saying this, but Digital has a
> pthreads implementation for WIN32. I bug them occasionally about packaging up
> the header and dll and selling it separately (for a reasonable price, of
> course). I think it's a great idea. My company has products on NT and UNIX,
> so it would solve some painful portability issues for us. This implementation
> uses the same "threads engine" that Digital uses, rather than just some
> wrappers on NT system services.

Yes, DECthreads has been ported to Win32 for quite a while. It runs on Windows
NT 3.51 and 4.0, on Alpha and Intel; and also on Windows 95 (though this was
not quite as trivial as Microsoft might wish us to believe.)

The main questions are:

  1. What's the market?
  2. How do we distribute the code, and at what cost? (Not so much "cost to the
     customer", as "cost to Digital".)

The big issue is probably that, especially with existing free code such as ACE,
it seems unlikely that there'd be much interest unless it was free or "dirt
cheap". Yet, even if we disclaim support, there will still be costs associated,
which means it'd be really tricky to avoid losing money.

> So, maybe if a few potential customers join me in asking Digital for this,
> we'll get somewhere.  What say, Dave?

We'd love to hear who wants this and why. Although I haven't felt comfortable
actually advertising the possibility here, I have forwarded the requests I've
seen here, and recieved via mail (including Jeff's) to my manager, who is the
first (probably of several) who needs to make any decisions.

I'd be glad to forward additional requests. Indications of what sort of product
(e.g., in particular, things like "sold for profit" or "internal utility"
distinctions), and, of course, whether (and how much) you'd be willing to pay,
would be valuable information.

/---------------------------[ Dave Butenhof ]--------------------------\

From: Ben Elliston 

Matthias Block  writes:

> is there someone who knows anything about a Pthread like library for
> Windows NT. It would simplify the work here for me.

I am involved with a free software project to implement POSIX threads
on top of Win32.   For the most part, it is complete, but it's still
well and truly in alpha testing right now.

I expect to be posting an announcement in a few weeks (say, 4?) to
comp.programming.threads.  The source code will be made available via
anonymous CVS for those who want to keep up to date or submit
patches.  I'm looking forward to getting some net testing!


Over the last several months I have seen some requests for
a Win32 implementation of PThreads.  I, too, had been looking
for such an implementation but to no avail.

Back in March, I decided to write my own. It is based upon the
PThreads 1003.1c standard, however, I didn't implement everything.
Missing is signal handling and real-time priority functions.

I based the implementation on the description provided by

    Programming with POSIX Threads, by
    Dave R. Butenhof

I've created a zipped file consisting of some header files, an implib, 
a DLL and a simple test program.

I'm providing this implementation for free and as-is. You may download it
from

    http://www.cyberus.ca/~jebossom/pthread1c.html

Cheers,

John


--
John E. Bossom                                     Cognos Inc.
Voice: (613) 738-1338 x3386        O_o             P.O. Box 9707, Stn. T
  FAX: (613) 738-0002             =(  )= Ack!      OTTAWA, ON  K1G 4K9
 INET:  [email protected]          U             CANADA
=================================TOP===============================
 Q49: What about garbage collection?  


Please, please, please mention garbage collection when you come around
to talking about making code multithreaded.  A whole lot of
heap-allocated data needs to be explicitly reference counted *even
more* in a multithreaded program than in a single threaded program
(since it is so much harder to determine whether data is live or not),
and this leads to lots of bugs and worries and nonsense.

With garbage collection, on the other hand, you get to throw away
*all* of your worries over memory management.  This is a tremendous
win when your brain is already approaching meltdown due to the strain
of debugging subtle race conditions.

In addition, garbage collection can help to make the prepackaged
libraries you link against safer to play with (although it obviously
won't help to make them thread safe).  Xt, for example, is very badly
written and leaks like a sieve, but a conservative garbage collector
will safely kill off those memory leaks.  If you're linking against
legacy libraries and you need to write a long-running multithreaded
server, GC can make the difference between buying more RAM and killing
your server every few days so that it doesn't thrash, and simply
plugging in the threads-aware GC and sailing fairly happily along.

Bryan O'Sullivan 

[Please see: Geodesic Systems (www.geodesic.com)     -Bil]

=================================TOP===============================
 Q50: Does anyone have any information on thread programming for VMS?  

No ftp or web stuff, although we do have an HTML version of the Guide to
DECthreads and we'll probably try to get it outside the firewall where
it'll do y'all some good, one of these days. I've been very impressed
with Sun's "thread web site", and I'd like to get Digital moving in that
direction to help with the global work of evangelizing threads... but
not until I've finished coding, and writing, and consulting, and all
sorts of other things that seem to take 500% of my time. For general
info, and some information (though not enough) on using POSIX threads,
check Sun's library. (They need to start tapering off the UI threads.)

If you've got VMS (anything since 5.5-2), you'll have a hardcopy of the
Guide in your docset, and on the doc cdrom in Bookreader format. OpenVMS
version 7.0 has POSIX 1003.1c-1995 threads -- anything earlier has only
the old CMA and 1003.4a/D4 "DCE threads". Furthermore, OpenVMS Alpha 7.0
supports SMP threads (kernel support for dynamic "many to few"
scheduling), although "binary compatibility paranoia" has set in and it
may end up being nearly impossible to use. OpenVMS VAX 7.0 does not have
SMP or kernel integration -- integration will probably happen "soon",
but VAX will probably never have SMP threads.

------------------------------------------------------------------------
Dave Butenhof                              Digital Equipment Corporation
[email protected]                       110 Spit Brook Rd, ZKO2-3/Q18
Phone: 603.881.2218, FAX: 603.881.0120     Nashua, NH 03062-2711
                 "Better Living Through Concurrency"
------------------------------------------------------------------------
=================================TOP===============================
 Q51: Any information on the DCE threads library?  

 http://www.osf.org/dce/
=================================TOP===============================
 Q52: Can I implement pthread_cleanup_push without a macro?  

I was about to use pthread_cleanup_push, when I noticed that it is
implemented as a macro (on Solraris 2.5) which forces you to have the
pthread_cleanup_pop in the same function by having an open brace { at the
end of the first macro and closing it int the second...  Since I want to
hide most of this stuff in something like a monitor (or a guard in ACE) in
C++ by using the push in a constructor and the pop in the destructor I'm
wondering if there is something fondamental that would prevent me to do so
or could I just re-implement the stuff done by the macros inside some class
services.



POSIX 1003.1c-1995 specifies that pthread_cleanup_push and pthread_cleanup_pop
must be used at the same lexical scope, "as if" the former were a macro that
expands to include an opening brace ("{") and the latter were a macro that
expands to include the matching closing brace ("}").

The Solaris 2.5 definition therefore conforms quite accurately to the intent
of the standard. And so does the Digital UNIX definition, for that matter. If
you can get away with "reverse engineering" the contents of the macros, swell;
but beware that this would NOT be a service to those using your C++ package,
as the results will be extremely non-portable. In fact, no guarantees that it
would work on later versions of Solaris, even assuming strict binary
compatibility in their implementation -- because they could reasonably make
"compatible" changes that would take advantage of various assumptions
regarding how those macros are used that you would be violating.

What you want to do has merit, but you have to remember that you're writing in
C++, not C. The pthread_cleanup_push and pthread_cleanup_pop macros are the C
language binding to the POSIX 1003.1c cancellation cleanup capability. In C++,
the correct implementation of this capability is already built into the
language... destructors. That is, C++ and threads should be working together
to ensure that C++ destructors are run when a thread is cancelled. If that is
done, you've got no problem. If it's not done, you've got far worse problems
anyway since you won't be "destructing" most of your objects anyway.

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
| Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q53: What switches should be passed to particular compilers?  

> Does anyone have a list of what switches should be passed to particular
> compilers to have them generate thread-safe code?  For example,
> 
> Solaris-2 & SunPro cc       : -D_REENTRANT
> Solaris-2 & gcc             : ??
> DEC Alpha OSF 3.2 & /bin/cc : -threads
> IRIX 5.x & /bin/cc          : ??
> 
> Similarly, what libraries are passed to the linker to link in threads
> support?
> 
> Solaris-2 & Solaris threads : -lthread
> DEC Alpha OSF 3.2 threads   : -lpthreads
> IRIX 5.x & IRIX threads     : (none)
> 
> And so forth.
> 
> I'm trying to get GNU autoconf to handle threads gracefully.
> 
> Bill

That would be useful information in general, I suppose. I can supply the
information for Digital UNIX (the operating system previously known as
"DEC OSF/1"), at least.

For 3.x and earlier, the proper compiler switch is -threads, which (for
cc) is effectively just -D_REENTRANT. For linking, the cc driver expands
-threads to "-lpthreads -lmach -lc_r" -- you need all three, immediately
preceeding -lc (which must be at the end). -lpthreads isn't enough, it
will pull in libmach and libc_r implicitly and in the wrong order (after
libc, where they will fail to preempt symbols).

For 4.0, you can still use -threads if you're using the DCE threads (D4)
or cma interfaces. If you don't use -threads, the link libraries should
be changed to "-lpthreads -lpthread -lmach -lexc" (before -lc). If you
use 1003.1c-1995 threads, you use "-pthread" instead of "-threads". cc
still gets -D_REENTRANT, but ld gets -lpthread -lmach -lexc.

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
| Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q54: How do I find Sun's bug database?  

>I am trying to use Thread Analyzer in Solaris 2.4 for performance
>tuning. But after loading the trace directory, tha exit with following
>error message: 
>Thread Analyzer Fatal Error[0]: Slave communication failure


It always helps if you state which version of the application you are
using, in this case the Thread Analyzer.

There have been a number of bugs which result in this error message
that have been fixed.  Please obtain the latest ThreadAnalyzer patch
from your Authorized Service Provider (ASP) or from our Wep page:

http://access1.sun.com/recpatches/DevPro.html
=================================TOP===============================
 Q55:  How do the various vendors' threads libraries compare?  

    Fundamentally, they are all based on the same paradigm, and everything
    you can do in one library you can (pretty much) do in any other.  Ease
    of programming and efficency will be the major distinctions.

OS                Preferred Threads POSIX Version   Kernel Support Sched model
---------------   ----------------- -------------   -------------- -------------
Solaris 2.5       UI-threads        1003.1c-1995    yes            2 level(1)
SVR4.2MP/UW 2.0   UI-threads        No
IRIX 6.1          sproc             No
IRIX 6.2          sproc             1003.1c-1995(2)
Digital UNIX 3.2  cma               Draft 4         yes            1 to 1
Digital UNIX 4.0  1003.1c-1995      1003.1c-1995    yes            2 level
DGUX 5.4          ?                 Draft 6         yes
NEXTSTEP          (cthreads?)       No
AIX 4.1           AIX Threads(3)    Draft 7         yes            1 to 1
Plan 9            rfork()           No
OpenVMS 6.2       cma               Draft 4         no
OpenVMS Alpha 7.0 1003.1c-1995      1003.1c-1995    yes            2 level
OpenVMS VAX 7.0   1003.1c-1995      1003.1c-1995    no
WinNT             Win32 threads     No
OS/2              DosCreateThread() Draft 4
Win32             Win32 threads     No              yes            1 to 1

Notes:

1) Solaris 2.5 blocks threads in kernel with LWP, but provides a signal to
allow user level scheduler to create a new LWP if desired (and
thr_setconcurrency() can create additional LWPs to minimize the chances of
losing concurrency due to blocking.)

2) According to IRIX 6.2 info on SGI's web, 1003.1c-1995 threads will be
provided only as part of the REACT/pro 3.0 Realtime Extensions kit, not in
the base O/S.

3) Can anyone clarify this? My impression is that AIX 4.1 favors 1003.4a/D7
threads; but then I've never heard the term "AIX Threads".

=================================TOP===============================
 Q56: Why don't I need to declare shared variables VOLATILE?  


> I'm concerned, however, about cases where both the compiler and the
> threads library fulfill their respective specifications.  A conforming
> C compiler can globally allocate some shared (nonvolatile) variable to
> a register that gets saved and restored as the CPU gets passed from
> thread to thread.  Each thread will have it's own private value for
> this shared variable, which is not what we want from a shared
> variable.

In some sense this is true, if the compiler knows enough about the
respective scopes of the variable and the pthread_cond_wait (or
pthread_mutex_lock) functions. In practice, most compilers will not try
to keep register copies of global data across a call to an external
function, because it's too hard to know whether the routine might
somehow have access to the address of the data.

So yes, it's true that a compiler that conforms strictly (but very
aggressively) to ANSI C might not work with multiple threads without
volatile. But someone had better fix it. Because any SYSTEM (that is,
pragmatically, a combination of kernel, libraries, and C compiler) that
does not provide the POSIX memory coherency guarantees does not CONFORM
to the POSIX standard. Period. The system CANNOT require you to use
volatile on shared variables for correct behavior, because POSIX
requires only that the POSIX synchronization functions are necessary.

So if your program breaks because you didn't use volatile, that's a BUG.
It may not be a bug in C, or a bug in the threads library, or a bug in
the kernel. But it's a SYSTEM bug, and one or more of those components
will have to work to fix it.

You don't want to use volatile, because, on any system where it makes
any difference, it will be vastly more expensive than a proper
nonvolatile variable. (ANSI C requires "sequence points" for volatile
variables at each expression, whereas POSIX requires them only at
synchronization operations -- a compute-intensive threaded application
will see substantially more memory activity using volatile, and, after
all, it's the memory activity that really slows you down.)

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
| Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q57: Do pthread_cleanup_push/pop HAVE to be macros (thus lexically scoped)?  

Paul Pelletier wrote:
 
I was about to use pthread_cleanup_push, when I noticed that it is
implemented as a macro (on Solaris 2.5) which forces you to have the
pthread_cleanup_pop in the same function by having an open brace { at the
end of the first macro and closing it int the second...  Since I want to
hide most of this stuff in something like a monitor (or a guard in ACE) in
C++ by using the push in a constructor and the pop in the destructor I'm
wondering if there is something fundamental that would prevent me to do so
or could I just re-implement the stuff done by the macros inside some class
services.
 

POSIX 1003.1c-1995 specifies that pthread_cleanup_push and
pthread_cleanup_pop must be used at the same lexical scope, "as if" the
former were a macro that expands to include an opening brace ("{") and the
latter were a macro that expands to include the matching closing brace
("}").

The Solaris 2.5 definition therefore conforms quite accurately to the intent
of the standard. And so does the Digital UNIX definition, for that
matter. If you can get away with "reverse engineering" the contents of the
macros, swell; but beware that this would NOT be a service to those using
your C++ package, as the results will be extremely non-portable. In fact, no
guarantees that it would work on later versions of Solaris, even assuming
strict binary compatibility in their implementation -- because they could
reasonably make "compatible" changes that would take advantage of various
assumptions regarding how those macros are used that you would be violating.

What you want to do has merit, but you have to remember that you're writing
in C++, not C. The pthread_cleanup_push and pthread_cleanup_pop macros are
the C language binding to the POSIX 1003.1c cancellation cleanup
capability. In C++, the correct implementation of this capability is already
built into the language... destructors. That is, C++ and threads should be
working together to ensure that C++ destructors are run when a thread is
cancelled. If that is done, you've got no problem. If it's not done, you've
got far worse problems anyway since you won't be "destructing" most of your
objects anyway.

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\

=================================TOP===============================
 Q58: Thread Analyzer Fatal Error[0]: Slave communication failure ??  

>I am trying to use Thread Analyzer in Solaris 2.4 for performance
>tuning. But after loading the trace directory, tha exit with following
>error message: 
>Thread Analyzer Fatal Error[0]: Slave communication failure
>
>I do not know what happened. 

It always helps if you state which version of the application you are
using, in this case the Thread Analyzer.

There have been a number of bugs which result in this error message
that have been fixed.  Please obtain the latest ThreadAnalyzer patch
from your Authorized Service Provider (ASP) or from our Wep page:

http://access1.sun.com/recpatches/DevPro.html

Chuck Fisher 

=================================TOP===============================
 Q59: What is the status of Linux threads?  



=================================TOP===============================
 Q60: The Sunsoft debugger won't recognize my PThreads program!  

Nope.  The 3.0.2 version was written before the release of Sun's pthread
library.  However, if you simply include -lthread on the compile line, it
will come up and work.  It's a little bit redundant, but works fine.  Hence:

%cc -o one one.c -lpthread -lthread -lposix4 -g

=================================TOP===============================
 Q61: How are blocking syscall handled in a two-level system?  

> Martin Cracauer wrote:
> >
> > In a thread system that has both user threads and LWPs like Solaris,
> > how are blocking syscall handled?
> 
> Well, do you mean "like Solaris", or do you mean "Solaris"? There's no
> one answer for all systems. LWP, by the way, isn't a very general term.
> Lately I've been using the more cumbersome, but generic and relatively
> meaningful "kernel execution contexts". A process is a KEC, an LWP is a
> KEC, a "virtual processor" is a KEC, a Mach thread is a KEC, an IRIX
> sproc is a KEC, etc.
> 
> > By exchanging blocking syscalls to nonblocking like in a
> > pure-userlevel thread implementation?
> 
> Generally, only "pure user-mode" implementations, without any kernel
> support at all, resort to turning I/O into "nonblocking". It's just not
> an effective mechanism -- there are too many limitations to the UNIX
> nonblocking I/O model.
> 
> > Or by making sure a thread that calls a blocking syscall is on its own
> > LWP (the kernel is enterend anyway, so what would be the cost to do
> > so)?
> 
> Solaris 2.5 "latches" a user thread onto an LWP until it blocks in user
> mode -- on a mutex, a condition variable, or until it yields. User
> threads aren't timesliced, and they stick to the LWP across kernel
> blocks. If all LWPs in a process block in the kernel, a special signal
> allows the thread library to create a new one, but other than that you
> need to rely a lot on thr_setconcurrency.
> 
> Digital UNIX 4.0 works very differently. The kernel delivers "upcalls"
> to the user mode scheduler to communicate various state changes. User
> threads, for example, are timesliced on our KECs (which are a special
> form of Mach thread). When a thread blocks in the kernel, the user mode
> scheduler is informed so that a new user thread can be scheduled on the
> virtual processor immediately. The nice thing about this model is that
> we don't need anything like thr_setconcurrency to keep things running.
> Compute-bound user threads can't lock each other out unless one is
> SCHED_FIFO policy. And instead of "fixing things up" by adding a new
> kernel execution context when the last one blocks (giving you a
> concurrency level of 1), we keep you running at the maximum level of
> concurrency supportable -- the number of runnable user threads, or the
> number of physical processors, whichever is less.
> 
> Neither model (nor implementation) is perfect, and it would be safe to
> assume that both Digital and Sun are working on improving every aspect.
> The models may easily become very different in the future.
> 
> /---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
> | Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
> | 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
> \-----------------[ Better Living Through Concurrency ]----------------/

-- 
> Georges Brun-Cottan wrote:
> > So recursive mutex is far more than just a hack for lazy programmer or
> > just a way to incorporate non MT safe third party code. It is a tool
> > that you need in environment such OOP, where you can not or you do not
> > want to depend of an execution context.
> 
> Sorry, but I refuse to believe that good threaded design must end where
> OOP begins. There's no reason for two independently developed packages
> to share the same mutex. There's no reason for a package to be designed
> without awareness of where and when mutexes are locked. Therefore, in
> either case, recursive mutexes remain, at best, a convenience, and, at
> worst (and more commonly), a crutch.
> 
> I created the recursive mutex for DCE threads because we were dealing
> with a brand-new world of threading. We had no support from operating
> systems or other libraries. Hardly anything was "thread safe". The DCE
> thread "global mutex" allowed any thread-safe code to lock everything
> around a call to any unsafe code. As an intellectual exercise, I chose
> to implement the global mutex by demonstrating why we'd created the
> concept of "mutex attributes" -- previously, there had been none. As a
> result of this intellectual exercise, it became possible for anyone to
> conveniently create their own recursive mutex, which is locked and
> unlocked using the standard POSIX functions. There really wasn't any
> point to removing the attribute, since it's not that hard to create your
> own recursive mutex.
> 
> Remember that whenever you use recursive mutexes, you are losing
> performance -- recursive mutexes are more expensive to lock and unlock,
> even without mutex contention (and a recursive mutex created on top of
> POSIX thread synchronization is a lot more expensive than one using the
> mutex type attribute). You are also losing concurrency by keeping
> mutexes locked so long and across so much context that you become
> tempted to use recursive mutexes to deal with lock range conflicts.
> 
> Yes, it may be harder to avoid recursive mutexes. Although I've never
> yet seen a valid case proving that recursive mutexes are NECESSARY, I
> won't deny that there may be one or two. None of that changes the fact
> that an implementation avoiding recursive mutexes will perform, and
> scale, far better than one relying on recursive mutexes. If you're
> trying to take advantage of multithreading, all the extra effort in
> analysis and design will pay off in increased concurrency.
> 
> But, like any other aspect of performance analysis, you put the effort
> where the pay is big enough. There are non-critical areas of many
> libraries where avoiding recursive mutexes would be complicated and
> messy, and where the overhead of using them doesn't hurt performance
> significantly. Then, sure, use them. Just know what you're doing, and
> why.
> 
> /---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
> | Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
> | 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
> \-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q62: Can one thread read from a socket while another thread writes to it?  

It's supposed to work!  That's certainly how sockets are defined.  It's
an easy enough test on your own system.
=================================TOP===============================
 Q63: What's a good way of writing threaded C++ classes?  

> 
> Ian Emmons wrote:
> >
> > Baard Bugge wrote:
> > >
> > > >How would we put the whole object into a thread?
> > >
> > > Been there. Done that. Let the constructor create a thread before
> > > returning to the caller (another object). But beware, your OS will
> > > propably start the thread by calling a function (specified by you)
> > > C-style. You want this function to be a member function in your class,
> > > which is ok as long as you make it static. The thread function will
> > > also need the this-pointer to your newly created object. What you want
> > > will look something like this (in NT):
> > >
> > > // Thread callback function.
> > > // NOTE: Need to be written in C or be a static member function
> > > // because of C style calling convention (no hidden this pointer)
> > > LPTHREAD_START_ROUTINE CThread::ThreadFunc(LPVOID inputparam)
> > > {
> > >    CThread *pseudo_this = (CThread *) inputparam;
> > >    ...
> > > }
> > >
> > > This function have access to all the members in the object through the
> > > pseudo this pointer. And all member functions called by this function
> > > will run in the same thread. You'll have to figure out how to
> > > communicate with the other objects in your system though. Be careful.
> > >
> > > --
> > > BaBu
> >
> > You can take this even a step further.  Add a pure virtual to your generic
> > CThread class like so:
> >
> > class CThread
> > {
> >       ...
> > protected:
> >     // I don't remember what Win32 expects as the return value, here,
> >     // but you can fix this up as you wish:
> >     virtual unsigned entryPoint() = 0;
> >       ...
> > };
> >
> > Then have the static ThreadFunc call it like so:
> >
> > // Thread callback function.
> > // NOTE: Need to be written in C or be a static member function
> > // because of C style calling convention (no hidden this pointer)
> > LPTHREAD_START_ROUTINE CThread::ThreadFunc(LPVOID inputparam)
> > {
> >    return ((CThread*) inputparam)->entryPoint();
> > }
> >
> > Now, to create a specific thread, derive from CThread, override entryPoint,
> > and you no longer have to mess around with a pseudo-this pointer, because
> > the real this pointer is available.
> >
> > One tricky issue:  make sure you differentiate between methods that the
> > thread itself will call, and methods that other threads (such as the one
> > that created the thread object) will call -- you will need to do thread
> > synchronization on class members that are shared data.
> >
> > Ian
> >
> > ___________________________________________________________________________
> > Ian Emmons                                       Work phone: (415) 372-3623
> > [email protected]                              Work fax:   (415) 341-8432
> > Persistence Software, 1720 S. Amphlett Blvd. Suite 300, San Mateo, CA 94402
> 

------------------
OK, let me warn everyone this is a very long response, but I just came off
of a large radar project on which I had to design multithreaded objects so
this question jumped out at me.

Yousuf Khan  wrote in article
<[email protected]>...
> I got some hypothetical questions here, I'm not actually now trying to
> do any of this, but I can see myself attempting something in the near
> future.
> 
> Okay, I'm thinking multithreading and OO design methodologies are
> tailor-made for each other, in theory at least. OO design mandates that
> all object instances are considered concurrent with each other. That
> seems like a perfect application of threading principles. However,
> current threading protocols (such the POSIX Pthreads, Microsoft/IBM
> threads, Sun UI threads, etc.) seem to be based around getting
> subprocedures threaded, rather than getting objects threaded.

First, let me state my own programming background so you can apply the
appropriate grain of salt to what I say and understand my assumptions.  I
have programmed first for a few years in a DEC, VMS environment and then for
several more in a Windows/Windows NT environment.

> Okay, I suppose we can get individual methods within an object to be
> threaded, because they are just like subprocedures anyways. But what if we
> wanted to be extremely pedantic, and we want the entire object to be in
> its own thread, in order to be true to OO design paradigms? How would we
> put the whole object into a thread?  My feeling is that we should just
> call the object's constructor inside a thread wrapper, that way the entire
> object will go into a thread, including any other methods that are part of
> that object. What I guess I'm saying is that will calling the constructor
> inside a thread wrapper, only run the constructor inside that thread and
> then the thread will end, or will the entire object now run inside that
> thread from now on?  Am I being oversimplistic in my speculation?


If you want to force an object to, as you say, "run in one thread", you
would have to be able to make public every member function perform a context
switch to the desired thread upon entering the function and switch back upon
exiting.  You would have to protect all member variables and use Get/Set
functions for them that performed context switches as well.

Under Windows NT, if you send a message to a window created by a different
thread, that context switch is performed for you by the operating system.
Your process waits until the ::SendMessage() call completes.  Other than
using SendMessage(), I do not know how you would accomplish such an
operation.  And SendMessage requires a window to which the message will be
sent.  Thus, under NT, you would have to make your object create some kind
of hidden window in the context of the desired thread and then have every
member function do a ::SendMessage() to that window.

(There are variations -- e.g. SendMessageCallback(), PostMessage(), etc for
asynchronous function calls)

Such a design is possible, and maybe workable, but seems to defeat the
purpose of threads, doesn't it?  If one thread is just going to have to wait
for the special thread every function call, why have the special thread at
all?

And I haven't even considered OLE and accessing objects across process
boundaries, or thread-local storage.

(Again, I'm speaking pretty exclusively about the NT threading model here.
I've had enough VMS to last me a lifetime and know very little about Posix
threads.)

It seems your reason for wanting the entire object to run in its own thread
is to be true the OO paradigm, but I think that's perhaps too much of a good
thing.

Why not make your objects completely thread-safe instead?  Create some sort
of a Single-Writer / Multiple-Reader resource locking object for all objects
of the class.  Make each member function use this resource guard, acquiring
a read-lock if it's a const member function or write-lock if it is not
const.

There's nothing to prevent you from assigning specific threads to the
objects to do background work on them, but as long as all access to the
objects is through those safe member functions, they are completely thread
safe..

I mention this because this is how I designed a large radar project I just
finished working on.  I used completely thread-safe, reference counted
objects, read/write locks, and smart pointers in my design and the results
were far better than my most optimistic hopes.  A very fast workstation
program with many dynamic displays showing an enormous amount of continously
changing data stored in a SQL server database.

I've gone on way too long here so I'll end this without saying half of what
I want to say.  Hope this gives you a few ideas.
=================================TOP===============================
 Q64: Can thread stacks be built in privately mapped memory?  

I've avoided any response to this long thread for a while because I'm not
sure I want to confuse the issue with facts. And, despite, the facts, I like
the idea of people learning to treat thread stacks "as if they might be"
private.

Nevertheless, at some time I thought it might be helpful to point out what
POSIX says about the matter... and I guess this is a good time.

POSIX very specifically disallows "non-shared" memory between threads.  That
is, it requires that the address space is associated with the PROCESS, not
with the individual THREADS. All threads share a single virtual address
space, and no memory address is private. Stacks, in particular, CANNOT be
set up with private memory. Although, for safe programming, you should
almost always pretend that it's private.

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
|  Digital Equipment Corporation         110 Spit Brook Rd ZKO2-3/Q18  |
|  603.881.2218, FAX 603.881.0120                Nashua NH 03062-2698  |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q65: Has anyone implemented a mutex with a timeout?  

Has anyone implemented a mutex locking function on top of Solaris or POSIX
threads with a timeout?  The problem I'm trying to solve is if a thread is
unable to obtain a mutex after a certain timeframe (say 30 seconds), then I
want the thread to terminate and return an error.  The Solaris and POSIX
API's only allow the user to check if a mutex can be obtained.

Of course! Check out the code for pthread_np_timed_mutex_t at
http://www.lambdacs.com/jan-97/examples.html

=================================TOP===============================
 Q66: I think I need a FIFO mutex for my program...  

>There are VERY few cases where "lock ordering" is truly necessary. In
>general, when it may seem to be necessary, using a work queue to distribute
>the work across a pool of threads will be easier and more efficient. If
>you're convinced that you need lock ordering, rather than POSIX wakeup
>ordering, you have to code it yourself -- using, essentially, a work queue
>model where threads wishing to lock your "queued mutex" queue themselves in
>order. Use a condition variable and "next waiter" predicate to ensure proper
locking order. It's not that hard.

Right, and you can find a freely available implementation of essentially a
"FIFO Mutex" in ACE.  Take a look at

http://www.cs.wustl.edu/~schmidt/ACE_wrappers/ace/Token.h
http://www.cs.wustl.edu/~schmidt/ACE_wrappers/ace/Token.i
http://www.cs.wustl.edu/~schmidt/ACE_wrappers/ace/Token.cpp

Doug
--
Dr. Douglas C. Schmidt                  ([email protected])
Department of Computer Science, Washington University
St. Louis, MO 63130. Work #: (314) 935-4215; FAX #: (314) 935-7302
http://www.cs.wustl.edu/~schmidt/

You can also find an implementation of FIFO mutexes in the file
pthread_np.{h,c} at: http://www.lambdacs.com/jan-97/examples.html
=================================TOP===============================
 Q67: Why my multi-threaded X11 app with LinuxThreads crashes?  
> Wolfram Gloger wrote:
> >
> > [email protected] (Jeff Noll) wrote:
> >
> > >       I'm making an X client that connects to a tcp socket. I'm using a
> > > thread to continually read from that socket connection and a text
> > > widget to sent to the socket. (an X telnet program that looks kind of
> > > like ncftp, seperate input/output windows). When i run this at school
> > > under solaris it seems to be fine, but when i take it home and try it
> > > under linus using linuxthreads 0.5 it crashes when i start entering
> > > into the text window.
> >
> > Crash as in `fatal' X errors ?  A while ago I had a similar experience
> > when trying to create a multi-threaded X11 app with LinuxThreads.  It
> > was quite easy to debug though: the LinuxThreads libpthread library
> > lets all threads get individual errno values (like they should), as
> > long as all sources are compiled with _REENTRANT defined.
> >
> > The X11 libs (at least in XFree86-3.2) are by default not compiled in
> > this way, unfortunately (note I'm not talking about multiple thread
> > support in X11), and they break when using LinuxThreads, e.g. because
> > Xlib relies on read() returning with errno==EAGAIN at times.  This is
> > a problem even when one restricts oneself to using X from a single
> > thread only.
> >
> > Once I recompiled all X11 libs with -D_REENTRANT (totally independent
> > of libpthread), everything works fine.  I could put those libs up for
> > ftp if you're interested to check it out.
> >
> > Regards,
> > Wolfram.
> 
=================================TOP===============================
 Q68: How would we put a C++ object into a thread?  

> > > Been there. Done that. Let the constructor create a thread before
> > > returning to the caller (another object). But beware, your OS will
> > > propably start the thread by calling a function (specified by you)
> > > C-style. You want this function to be a member function in your class,
> > > which is ok as long as you make it static. The thread function will
> > > also need the this-pointer to your newly created object. What you want
> > > will look something like this (in NT):
> > >
> > > // Thread callback function.
> > > // NOTE: Need to be written in C or be a static member function
> > > // because of C style calling convention (no hidden this pointer)
> > > LPTHREAD_START_ROUTINE CThread::ThreadFunc(LPVOID inputparam)
> > > {
> > >    CThread *pseudo_this = (CThread *) inputparam;
> > >    ...
> > > }
> > >
> > > This function have access to all the members in the object through the
> > > pseudo this pointer. And all member functions called by this function
> > > will run in the same thread. You'll have to figure out how to
> > > communicate with the other objects in your system though. Be careful.
> > >
> > > --
> > > BaBu
> >
> > You can take this even a step further.  Add a pure virtual to your generic
> > CThread class like so:
> >
> > class CThread
> > {
> >       ...
> > protected:
> >     // I don't remember what Win32 expects as the return value, here,
> >     // but you can fix this up as you wish:
> >     virtual unsigned entryPoint() = 0;
> >       ...
> > };
> >
> > Then have the static ThreadFunc call it like so:
> >
> > // Thread callback function.
> > // NOTE: Need to be written in C or be a static member function
> > // because of C style calling convention (no hidden this pointer)
> > LPTHREAD_START_ROUTINE CThread::ThreadFunc(LPVOID inputparam)
> > {
> >    return ((CThread*) inputparam)->entryPoint();
> > }
> >
> > Now, to create a specific thread, derive from CThread, override entryPoint,
> > and you no longer have to mess around with a pseudo-this pointer, because
> > the real this pointer is available.
> >
> > One tricky issue:  make sure you differentiate between methods that the
> > thread itself will call, and methods that other threads (such as the one
> > that created the thread object) will call -- you will need to do thread
> > synchronization on class members that are shared data.
> >
> > Ian
> >
> > ___________________________________________________________________________
> > Ian Emmons                                       Work phone: (415) 372-3623
> > [email protected]                              Work fax:   (415) 341-8432
> > Persistence Software, 1720 S. Amphlett Blvd. Suite 300, San Mateo, CA 94402
> 
=================================TOP===============================
 Q69: How different are DEC threads and Pthreads?  
Mike.Lanni wrote:
> 
> Baard Bugge wrote:
> >
> > According th the thread-faq, DCE threads (as in HPUX 10.10) is an
> > older version of Posix 1003.1c threads (as in Solaris 2.5).
> >
> > Whats the differences? Is the two of them, fully or partly, source
> > code compatible?
> >
> > I want my multirhreaded code to be cross-compilable on at least the
> > two platforms mentioned above, without too many ifdefs. Can I?
> >
> > --
> > BaBu
> 
> Unfortunately, this is not black and white. If HPUX 10.10 is based on
> Draft 7 or higher, the Solaris and HP codes should be similar. However,
> if HP 10.10 is based on Draft 4, then there is quite a bit of work to be
> done. D4 became popular due to its usage with DCE. Assuming the worst,
> D4, here are some notes that I've put together based on some programming
> I've done. It is not complete by any means, but it should give you an
> idea of what you are up against.
> 
>  - signal handling is different
>  - return codes from pthreads api's are now the real error, vrs. -1
> and    errno
>  - possibly no support for the "non-portable" apis and symbolic
> constants
>  - non support for DCE exception handling
>  - Some of the pthread_attr_ apis have different types and arguments.
>  - Some of the scheduling apis have changed.
>  - Some thread specific api's have changed parameters.
> 
> Below are some mappings that at one time were valid...
> 
> #if defined(_D4_)
> #define PTHREAD_ONCE pthread_once_init
> #define PTHREAD_ATTR_DEFAULT pthread_attr_default
> #define PTHREAD_MUTEXATTR_DEFAULT pthread_mutexattr_default
> #define PTHREAD_CONDATTR_DEFAULT pthread_condattr_default
> #define INITROUTINE pthread_initroutine_t
> #define PTHREAD_ADDR_T pthread_addr_t
> #define START_RTN pthread_startroutine_t
> #define PTHREAD_YIELD pthread_yield
> #define PTHREAD_ATTR_DELETE pthread_attr_delete
> #define PTHREAD_ATTR_CREATE pthread_attr_create
> #define PTHREAD_MUTEXATTR_DELETE pthread_mutedattr_delete
> #define PTHREAD_MUTEXATTR_CREATE pthread_mutedattr_create
> #define PTHREAD_CONDATTR_DELETE pthread_condattr_delete
> #define PTHREAD_CONDATTR_CREATE pthread_condattr_create
> #define PTHREAD_KEYCREATE pthread_keycreate
> #define ATFORK atfork
> #define SIGPROCMASK sigprocmask
> #else
> #define PTHREAD_ONCE PTHREAD_ONCE_INIT
> #define PTHREAD_ATTR_DEFAULT NULL
> #define PTHREAD_MUTEXATTR_DEFAULT NULL
> #define PTHREAD_CONDATTR_DEFAULT NULL
> #define INITROUTINE void *
> #define PTHREAD_ADDR_T void *
> #define START_RTN void *
> #define PTHREAD_YIELD sched_yield
> #define PTHREAD_ATTR_DELETE pthread_attr_destroy
> #define PTHREAD_ATTR_CREATE pthread_attr_init
> #define PTHREAD_MUTEXATTR_DELETE pthread_mutedattr_destroy
> #define PTHREAD_MUTEXATTR_CREATE pthread_mutedattr_init
> #define PTHREAD_CONDATTR_DELETE pthread_condattr_destroy
> #define PTHREAD_CONDATTR_CREATE pthread_condattr_init
> #define PTHREAD_KEYCREATE pthread_key_create
> #define ATFORK pthread_atfork
> #define SIGPROCMASK pthread_sigmask
> #endif
> #if defined(_D4_)
>       rc = pthread_detach(&tid;);
>       rc = pthread_exit(status);
>       rc = pthread_join(tid, &status;);
>          pthread_setcancel(CANCEL_OFF);
>          pthread_setcancel(CANCEL_ON);
>     (void) pthread_setscheduler(pthread_self(),SCHED_FIFO,PRI_FIFO_MAX);
> #else
>       rc = pthread_detach(tid);
>       rc = pthread_exit(&status;);
>       rc = pthread_join(tid, &status;_p);
>          pthread_setcancelstate(PTHREAD_CANCEL_DISABLE,NULL);
>          pthread_setcancelstate(PTHREAD_CANCEL_ENABLE,NULL);
>     struct sched_param param;
>     param.sched_priority = 65535;
>     (void) pthread_setschedparam(pthread_self(),SCHED_FIFO,¶m;);
> #endif /* _D4_ */
> 
> Hope this helps.
> 
> Mike L.
> --------------------------------------------------------------------
> Michael J. Lanni
> NCR                            email:  [email protected]
> 3325 Platt Springs Road        phone:  803-939-2512
> West Columbia, SC 29170          fax:  803-939-7317
> http://www.columbiasc.ncr.com/home_pages/mlanni.html
=================================TOP===============================
 Q70: How can I manipulate POSIX thread IDs?  
Steven G. Townsend wrote:
> 
> Jim Robinson wrote:
> >
> > In article <[email protected]>, Ian Emmons wrote:
> > >Robert Patrick wrote:
> > >>
> > >> Yes, you can copy one pthread_t to another.  The part you have to be
> > >> careful about is that in some implementations pthread_t is a struct
> > >> and in others it is not.  Therefore, setting two pthread_t's to be
> > >> equal by assignment will not be portable.  However, memcpy(a, b,
> > >> sizeof(pthread_t)) should always work.
> 
> As to the assignment issue, see Jim's comment below.
> As to the first point, assume for the moment that that a part of
> the structure is an array (status, pointers to currently allocated
> keys, whatever) if anything in the array can change, the "copy"
> will not be updated. Assume a status flag/bit which indicates
> whether the thread is runnable, looking at the copy could easily
> produce different results than the actual value of the "true"
> pthread_t.  This is just a bad thing to do.  Other problems
> can occur as well...
> What happens if both the original and the copy are passed to
> pthread_destroy?
> What happens if as we are doing the '=' or memcpy operation the
> thread is currently executing on a different processor (i.e.
> The contents of the pthread_t object would neeed to be protected
> by a mutex)?
> When it comes to copying pthread_t s...
>   Just say 'no'.
> 
> > >>
> > >> Just my two cents,
> > >> Robert
> > >
> > >Since I work in C++ exclusively, this isn't an issue for me, and so I never thought
> > >about that.  For C coders, you're right, of course.
> >
> > Structure assignment is defined in *ANSI* C. See page 127 of K&R;, 2nd
> > edition. Since ANSI C has been standardized for quite some time now, it
> > should be a non-issue for C coders as well, no?
> 
> > --
> > Jim Robinson
> > [email protected]
=================================TOP===============================
 Q71: I'd like a "write" that allowed a timeout value...  

Marc Peters wrote:
> What would be nice to have is a "write" that allowed a timeout value to be
> specified.  A la:
> 
>         write(fdSocket, bufferPtr, bufferLength, timeSpec);
> 
> If the write doesn't succeed withing the specified timeSpec, then errno
> should be set to ETIMEOUT or something.  Obviously, this would be quite handy
> in network code.
>
> Due to other circumstances beyond my control, the fdSocket cannot be placed
> in non-blocking mode.  Thus, the solution I'm left with is to start a POSIX
> timer, wait on the blocking write, and check for an errno of EINTR when it
> returns (if it timed out).
>
> I'm aware of the alternate technique of dedicating a thread to dispatching
> signal events.  This dedicated thread merely does a sigwaitinfo() and
> dispatches accordingly.  These technique, too, is offensive for such a simple
> requirement -- the "timed" write.

Why not just do these possibly long writes in separate threads? And, if
some "manager" decides they've gone too long, cancel the threads.

> I've ordered the POSIX 1003.1 standard to pursue this; however, it will be
> several days before it arrives.  Can anyone fill me in with some details of
> SIGEV_THREAD in the meantime?

SIGEV_THREAD creates a thread instead of raising a signal in the more
conventional manner. You get to specify the start routine (instead of
a signal catching function), and the attributes. The thread runs
anonymously (by default it's detached, and if you use an attributes
object with detachstate set to PTHREAD_CREATE_JOINABLE, the behavior
is unspecified).

The main advantage is that your "signal catching function" can lock
mutexes, signal condition variables, etc., instead of being restricted
to only the short list of async-signal safe functions.

In your case, a SIGEV_THREAD action would still constitute a "double
dispatch" for your signal. The code wouldn't look much different from
your current version.

Oh yeah, there's a major disadvantage, for you, in SIGEV_THREAD.
Solaris 2.5 doesn't implement SIGEV_THREAD. So you'd have to wait for
Solaris 2.6. (Just to be fair, I'll also point out that Digital UNIX 4.0
didn't do SIGEV_THREAD, either -- it is a minor and relatively obscure
function, and we all had more important things to worry about. We also
caught up to it later, and we'll be supporting SIGEV_THREAD in Digital
UNIX 4.0D [or at least, that's what we're calling it now, though these
things are always subject to change for various reasons].)

/---[ Dave Butenhof ]-----------------------[ [email protected] ]---\
| Digital Equipment Corporation           110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120                  Nashua NH 03062-2698 |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q72: I couldn't get threads to work with glibc-2.0.  

>I finally got this to compile cleanly, but it stalls (sigsuspend) somewhere
>in pthread_create()...

    If you are using glibc-2.0, you should upgrade to glibc-2.0.1. 
There is a bug in 2.0 that makes all thread creation fails. 

         Yann Doussot  
=================================TOP===============================
 Q73: Can I do dead-owner-process recovery with POSIX mutexes?  

el Gringo wrote:
> 
> Hi.
> 
> I am trying to create a mutxe on a program that has to work on NT 4.0 and AIX.
> For NT, I use CreateMutexes...etc, and in this case, if the process owning the
> mutex crashes, the system releases the mutex and returns WAIT_ABONDONED to the
> thread that is waiting for the mutex to be released. And if the mutex is
> opened several times by the same thread, the call succeses and the mutex count
> is incremeneted.
> 
> What I don't know is if the pthread mutexes do the same thing when a thread or
> process owning the mutexe crashes...or when the Pthread_mutex_lock() is called
> several times by the same one. Could someone provide me with a doc or web site
> so I can find those answers ? Thanks. Riad

Riad,

  Ah...

  You got a *big* set of problems to deal with.  The simple answer is that
POSIX mutexes don't do that, but that you can create that kind of behavior
if you want to.

  The problems I refer to are those surrounding what you do when the owner process
crashes.  There is a massive amount of work to be done to ensure that you don't
have corrupted data after you get WAIT_ABANDONED.  Unless you've already taken
care of this, you've got a h*** of a lot of work to do.

  So...  the best answer is go find an expert in this area (or spend a month
becoming one) and hash out the issues.  Building the mutex will be the easiest
part of it.

  Good luck.

-Bil
=================================TOP===============================
 Q74: Will IRIX distribute threads immediately to CPUs?  

Michel Lesoinne   wrote:
>The first question concerns new and delete under C++ as well as malloc.
>Are these functions thread-safe? 
------------------------------------------------------------------------
Yes, these calls are thread-safe.

>The second question has to do with multi-CPU machines. I have noticed
>that POSIX thread do not get shipped to other CPU's immediately after
>being started. For example if you bring up gr_osview and watch the CPU
>usage as you start 4 paralll threads, it takes approximately 1 second
>for the 4 threads to run in parallel rather than on the same CPU. Worse,
------------------------------------------------------------------------
The current pthread implementation only creates additional sprocs as it
deems necessary given the application activity.  Interaction between
pthreads which may require context switching can lower the requirement
while CPU-bound threads will raise it.  There can be a short delay
"ramping-up" before the ideal number of sprocs are active.  The kernel
is responsible for scheduling the sprocs on CPUs.  Thus you may be
seeing 2 effects (though 1 second seems a little long to me).

>Is there a way to force IRIX to distribute the threads immediately?
------------------------------------------------------------------------
Currently there is no way to force this behaviour though we expect to
add tuning interfaces in the future.  You may try experimenting by setting
the environment variable PT_ITC as a hint to the library that your app is
CPU bound.

    sincerely,
        ..jph...
=================================TOP===============================
 Q75: IRIX pthreads won't use both CPUs?  

Dirk Bartz   wrote:
>I've written a parallel program using pthreads on two processors
>which consists of two parallel stages.
>In first stage reads jobs from a queue (protected by a mutex) and
>should process them parallelly (which it doesn't); the second
>stage works fine.
>
>Now, some debugger sessions of the first stage show that
>both pthreads are started, but the first one is being blocked
>most of the time. The cvd says:
>   0    _procblk() ["procblk.s":15, 0x0fab4d74]
>   1    _blockproc() ["blockproc.c":23, 0x0fab56e0]
>   2    vp_idle() ["vp.c":1702, 0x07fe61e8]
>
>It seems that the first pthread is only sharing one processor
>with the second thread. It is *not* blocked at the mutex!
>
>Does anyone has a clue what happend?
------------------------------------------------------------------------
Hi,
    First of all that curious backtrace is from one of the underlying
sprocs on which the pthreads execute.  As you can see it is currently idle
and blocked in the kernel waiting for the library to activate it when more
work (a pthread) is ready to run.  If you use cvd showthread all command
it will show you the pthread state which should in your case be MUTEX-WAIT
for the pthread of interest.  If you then backtrace that pthread you should
see it blocked in the mutex locking code.

A second point to note is that the pthreads library attempts to use an
appropriate number of sprocs for its scheduling.  If your application creates
2 CPU-bound threads then on an MP machine 2 sprocs will be created to run
the threads.  On a UP only one sproc will be created and will switch between
the two pthreads.  On an MP where the threads are not CPU-bound the problem
is more complex; when 2 pthreads are tightly synchronised then a single
sproc may be a better choice - this may be what you are seeing.

I hope the above explains what you are seeing.

    sincerely,
        ..jph....
=================================TOP===============================
 Q76: Are there thread mutexes, LWP mutexes *and* kernel mutexes?  

> In a typical "two level" scheduling scheme, say solaris,
> synchronization primitives used at the thread level (POSIX or solaris)
> are provided by the user level scheduler library.  At the LWP level,
> are there any synchronization primitives, and if so, where would one
> use those as opposed to using the user level library primitives?
> Ofcourse, there would be some synchronization primitives for the
> kernel use.  Does it mean that there are 3 distinct set of primitives
> (user level, LWP level and kernel level)?  Can anyone throw some light
> on the LWP level primtives (if any) and point out where these would be
> useful?

  You may remember that scene in the Wizard of Oz, where Toto runs away
in panic at the sight of the Powerful Oz.  He discovers a little man 
running the machinery behind a curtin.  The catch-line was "Pay no attention
to that man behind the curtin."

  Same thing here.  You call pthread_mutex_lock() and it does whatever it
needs to so that things work.  End of story.

  But if you *really* want to peek...  If the mutex is locked, then the
thread knows that it needs to go to sleep, and it calls one set of routines
if it's an unbound thread, another if it's bound.  (If you hack around inside
the library code, you'll be able to see the guts of the thing, and you'll
find calls to things like _lwp_mutex_lock().  You will NEVER call those!)

  Now, as for kernel hacking, that's a different picture.  If you are going
to go into the kernel and write a new device driver or fix the virtual 
memory system, you'll be working with a different interface.  It's similar
to pthreads, but unique to each kernel.  The older OSs didn't even HAVE a
threads package!
=================================TOP===============================
 Q77: Does anyone know of a MT-safe alternative to setjmp and longjmp?  
=================================TOP===============================
>      I am taking an operating systems class; therefore, my
>    question will sound pretty trivial.  Basically, I am
>    trying to create thread_create, thread_yield, and thread_exit
>    functions.  Basically, I have two files.  They compile fine and
>    everything but whenever I try to run the program I get the error:
>    "longjmp or siglongjmp function used outside of saved context
>     abort process"
>    All I know is that we are running this on an alpha machine at
>    school [...]
> 
>    Anyway, I just want to know if anyone has ever tried to do a longjmp from a
>    jmp_buf that was not the same as that used in setjmp.
> 
> The runtime environment provided with some operating systems (e.g.,
> Ultrix or whatever DEC `Unix' is called these days) performs an
> explicit check that the destination stack frame is an ancestor of
> the current one.  On these systems you cannot use setjmp/longjmp
> (as supplied) to implement threads.
> 
> On systems whose longjmp is trusting, setjmp/longjmp is a very common
> way of building user-space threading libraries.  This particular wheel
> has been reinvented many times.
> 
> If you know the layout of a jmp_buf, you *can* use setjmp but you will
> have to implement a compatible longjmp yourself in order to change the
> processor context to that of the next task.  If you have a
> disassembler you might be able to reverse engineer a copy of longjmp
> with the check disabled.
> 
> *I* would consider this outside the scope of such an exercise but your
> professor may disagree.
> 
> Steve
> --
> Stephen Crane, Dept of Computing, Imperial College of Science, Technology and
> Medicine, 180 Queen's Gate, London sw7 2bz, UK:jsc@{doc.ic.ac.uk, icdoc.uucp}
> Unix(tm): A great place to live, but a terrible place to visit.
=================================TOP===============================
 Q78: How do I get more information inside a signal handler?   
Mark Lindner wrote:
> 
> I'm writing a multithreaded daemon that supports dynamic runtime loading
> of modules (.so files). I want it to be able to recover from signals such
> as SIGSEGV and SIGFPE that are generated by faulty module code. If a given
> module causes a fault, I want the daemon to unload that module so that
> it's not called again.
> 
> My problem is that once a signal is delivered, I don't know which worker
> thread it came from, and hence I have no idea which module is faulty. The
> O'Reilly pthreads book conveniently skirts this issue. I poked around on
> the system and found the getcontext() call; I tried saving the context for
> each worker thread, and then using the ucontext_t structure passed as the
> 3rd argument to the signal handler registered by sigaction(), but
> unforunately I can't find anything that matches...the contexts don't even
> appear to be the same.
> 
> Since the behavior of pthreads calls is undefined within a signal handler,
> I can't use pthread_self() to figure out which thread it is either.
> 
> All examples I've seen to date assume that either:
> 
> a) only one thread can generate a given signal
> 
> or
> 
> b) two or more threads can generate a given signal, but the signal handler
> does the same thing regardless of which thread generated it.
> 
> My situation doesen't fall into either of these categories.
> 
> Any help would be appreciated.
> 
> --
> 
> Cheers!
> Mark
> 
> ------------------------------------------------------------------------------
> [email protected]       | http://love.geology.yale.edu/~markl/
> ------------------------------------------------------------------------------
>                    I looked up
>                    As if somehow I would grasp the heavens
>                    The universe
>                    Worlds beyond number...
> ------------------------------------------------------------------------------
=================================TOP===============================
 Q79: Is there a test suite for Pthreads?  


Re: COMMERCIAL: Pthreads Test Suite Available
 
The Open Group VSTH test suite for Threads implementations of
POSIX 1003.1c-1995 and  the X/Open System Interfaces (XSH5) Aspen
threads extensions is now generally available.
For further information on the test suite see
http://www.rdg.opengroup.org/testing/testsuites/vsthover.htm
For information on the Aspen threads extensions see
http://www.rdg.opengroup.org/unix/version2/whatsnew/threads.html

> Andrew Josey, Email: [email protected]
> #include 
=================================TOP===============================
 Q80:  Flushing the Store Buffer vs. Compare and Swap  

Just looking at the CAS and InterLockedXXX instructions... 

  "Hey!" says I to myself, "Nobody's minding the store buffer!"
A couple of people have shown some examples of using InterLockedXXX
in Win32, but they never address memory coherency!

  So, if they implement a mutex with InterLockedExchange:

lock(int *lock)
{while (InterLockedExchange(lock, 1) == 1) sleep();} 

unlock(int *lock)
{*lock = 0;}

  at unlock time, some changed data might not be written out to
main memory.  Hence we need this:

unlock(int *lock)
{
 FlushStoreBuffer();
 *lock = 0;
}

  Or is there something about x86 that I don't know about here?
=================================TOP===============================
 Q81: How many threads CAN a POSIX process have?  

Dave Butenhof wrote:
> 
> Bryan O'Sullivan wrote:
> >
> > r> _POSIX_THREAD_THREADS_MAX that claims to be the maximum threads per
> > r> process.
> >
> > As I recall, this is a minimum requirement.  Solaris certainly
> > supports far more than 64 threads in a single process, and I'm sure
> > that Irix does, too.
> 
> POSIX specifies two compile-time constants, in , for each
> runtime limit. One is the MINIMUM value of that MAXIMUM which is
> required to conform to the standard. _POSIX_THREAD_THREADS_MAX must be
> defined to 64 on all conforming implementations, and all conforming
> implementations must not arbitrarily prevent you from creating at least
> that many threads.
> 
> The symbol PTHREAD_THREADS_MAX may ALSO be defined, to the true limit
> allowed by the system, IF (and only if) that limit is fixed and can be
> predicted at compile time. (The value of PTHREAD_THREADS_MAX must be at
> least 64, of course.) I don't know of any systems that define this
> symbol, however, because we don't implement any fixed limit on the
> number of threads. The limit is dynamic, and dictated purely by a wide
> range of resource constraints within the system. In practice, the only
> way to predict how many threads you can create in any particular
> situation is to bring a program into precisely that situation and count
> how many threads it can create. Remember that the "situation" includes
> the total size of your program text and data, any additional dynamic
> memory used by the process (including all shared libraries), the virtual
> memory and swapfile limits of the current system, and, in some cases,
> the state of all other processes on the system.
> 
> In short, the concept of "a limit" is a fiction. There's no such thing,
> without knowing the complete state of the system -- rarely practical in
> real life.
> 
> Oh, and by the way, there's no guarantee (in POSIX or anywhere else)
> that you can create even 64 threads. That just means that the system
> cannot arbitrarily prevent you from creating that many. If you use up
> enough virtual memory, you may be unable to create even one thread.
> That's life.
> 
> As Bryan said, you can normally rely on being able to create hundreds,
> and usually thousands, of threads on any of the current 2-level
> scheduling POSIX threads implementations. Kernel-level implementations
> are typically more limited due to kernel quotas on the number of kernel
> thread slots available for the system and often for each user.
=================================TOP===============================
 Q82: Can Pthreads wait for combinations of conditions?  

> Is there any way in Pthreads to wait for boolean combinations of conditions
> (i.e. wait for any one of a set of conditions or wait until all of a set of
> conditions have occurred). I'm looking for a feature similar to the VMS
> Wait for logical OR of event flags or the OS/2 multiplexed semaphores.

  You mean something like this:


void *consumer(void *arg)
{request_t *request;

 while(1)
   {pthread_mutex_lock(&requests;_lock);
    while ((length == 0) && (!stop))    <--  While both are true, sleep
      pthread_cond_wait(&requests;_consumer, &requests;_lock);
    if (stop) break;
    request = remove_request();
    length--;
    pthread_mutex_unlock(&requests;_lock);
    pthread_cond_signal(&requests;_producer);
    process_request(request);
  }
 pthread_mutex_unlock(&requests;_lock);
 sem_post(&barrier;);
 pthread_exit(NULL);
}


Or perhaps:

    while ( ((length == 0) && (!stop))  ||
        (age_of(granny) > 100) ||
        (no_data_on_socket(fd) && still_alive(client)) ||
        (frozen_over(hell))  )
      pthread_cond_wait(&requests;_consumer, &requests;_lock);


  Nope.  Can't be done  :-)


  Now if you're thinking about something that involves blocking, it may be
a bit trickier.  In OS/2 or Win32 you might think in terms of:

  WaitForMultipleObjects(Mutex1 and Mutex2)

you'll have to do a bit extra.  Perhaps you'll have two different threads 
blocking on the two mutexes:

    Thread 1
 pthread_mutex_lock(Mutex1);
 M1=TRUE;
 pthread_cond_signal(&requests;_consumer);


    Thread 2
 pthread_mutex_lock(Mutex2)
 M2=TRUE;
 pthread_cond_signal(&requests;_consumer);

    Thread 3
 while (!M1 || !M2)
   pthread_cond_wait(&requests;_consumer, &requests;_lock);



  I think this looks sort of ugly.  More likely you'll find a better way 
to structure your code.



=================================TOP===============================
 Q83: Shouldn't pthread_mutex_trylock() work even if it's NOT PTHREAD_PROCESS_SHARED?  

Curt,

  I infer your're trying to get around the lack of shared memory SVs
in some of the libraries by only using trylock?  I can't say that I
approve, but it ought to work...

  In the code example below I hacked up an example which does seem to
do the job.  I can't tell you what you were seeing in your tests.
Hmmm...  Just because this particular hack works on one OS does not
mean that it will necessarily work on another.  (Let's say I wouldn't
stake MY job on it!) 

  What about using shared semaphores?  Maybe SysV semaphores?

-Bil

> HI,
> 
> I'm having a problem with pthread_mutex_unlock () on Solaris 2.5 for a
> pthread_mutex_t inited in a shared memory structure between 2 tasks.
> 
> I get pthread_mutex_trylock (lockp) to return zero, and both tasks
> agree the mutex is locked.
> 
> When the owning task calls pthread_mutex_unlock (lockp), it returns
> zero, but the other task's pthread_mutex_trylock (lockp) still believes
> the mutex is locked??
> 
> FAQ location or help?  Thanks.
> 
> Heres how I initted the pthread_mutex_t struct:
> 
> In a shared memory struct:
> 
> pthread_mutex_t         mutex_lock =  PTHREAD_MUTEX_INITIALIZER;
> 
> Then either task may call:
> 
> pthread_mutex_trylock (&mutex;_lock)
> ...work...
> pthread_mutex_unlock (&mutex;_lock)
> 
> I've had little luck with  pthread_mutexattr_setpshared () to init for
> the "shared" scenario (core dumped) and especially that this is a Sun'ism
> that doesn't exist in DEC Unix 4.0b, which is another requirement, that
> the solution be portable to all/most Unix'es with threads.
> 
> Thanks.
> 
> Curt Smith
> [email protected]



       sleep(4-i/2);    /* wait a second to make it more interesting*/
      if (!err) {pthread_mutex_unlock(&buf-;>lock2);
          printf("Unlocked by parent\n");
        }
   }

  printf("Parent PID(%d): exiting...\n", getpid());
  exit(0);
}


main(int argc, char *argv[])
{int fd;
 pthread_mutexattr_t mutex_attr;

 /* open a file to use in a memory mapping */
 fd = open("/dev/zero", O_RDWR);

 /* Create a shared memory map with the open file for the data 
    structure which will be shared between processes. */

 buf=(buf_t *)mmap(NULL, sizeof(buf_t), PROT_READ|PROT_WRITE,
           MAP_SHARED, fd, 0);

 /* Initialize the counter and SVs.  PTHREAD_PROCESS_SHARED makes
    them visible to both processes. */


 pthread_mutexattr_init(&mutex;_attr);
/* pthread_mutexattr_setpshared(&mutex;_attr, PTHREAD_PROCESS_SHARED); */

 pthread_mutex_init(&buf-;>lock2, &mutex;_attr);

 if (fork() == 0)
   my_child();
 else
   my_parent();
}
=================================TOP===============================
 Q84: What about having a NULL thread ID?  

This is part of an on-going discussion.  The POSIX committe decided not to
do this.  It is, of course, possible to implement non-portable versions
yourself.  You would have to have a DIFFERENT version for different OSs.
-1L works fine for Solaris 2.5, and IRIX 6.2, but not HP-UX 10.30, which
requires (I think!) {-1, -1, -1}.  BE CAREFUL!!!

Now for the discussion...

Ben Self wrote:
> 
> Ian Emmons wrote:
> >
> > So why not support a portable "null" value?  This
> > could be done via a macro that can be used to initialize a pthread_t (just
> > like the macro that initializes a mutex), or it could be done via a couple
> > of functions to set and test for the null value.  Or, the POSIX folks could
> > do as they are used to doing, and make us code around their ommissions with
> > YAC (Yet Another Class).
> >
> > Ian
> >
> As I stated before I think that it is a very natural thing to want to
> do.  In my experience POSIX's omissions are usually more interesting
> than simple oversights.  Often an existing industry code base stands in
> the way or it was deemed too 'trivial' a matter of user code to merrit
> imposing any restriction on implimentation.  in any case, for all
> intents and purposes, it is a done deal.
> 
> As Dave Butenhof candidly exposes a few posts down:
> 
> "due to
> overwhelming agreement that it was a bad (and unnecessary) complication.
> The Aspen committee added suspend/resume to UNIX98 -- but the functions
> were later removed with no significant objections.
> 
> There simply is no significant industry concensus supporting these
> functions. (And that for many very good technical reasons as well as
> some silly political reasons."
> 
> that is exactly how POSIX works.  gotta love 'em
> 
> --Ben
=================================TOP===============================
 Q85: Explain Traps under Solaris  

Jim Moore - Senior Engineer, SunSoft wrote:
Email : [email protected]               |   DNRC: The Day Cometh
SunSoft Technical Support (Europe)         |   "adb is your friend"

    ---------------------------------------------------------------

                        SPARC traps under SunOS (Solaris)
                        -=-=-=-=-=-=-=-=-=-=-=-
                By: Jim Moore, SunSoft, Sun Microsystems Inc
                      Email: [email protected]

        CONTENTS

        1       Introduction
        1.1     Who should read this document?

        2       What is a trap?
        2.1     How traps are caused
        2.1.1   Precise Traps
        2.1.2   Deferred Traps
        2.1.3   Disrupt/Interrupt Traps
        2.2     How traps are dispatched to the kernel
        2.2.1   SPARC v7/v8
        2.2.2   SPARC v9
        2.2.2.1 Processor states, normal and special traps
        2.2.2.2 Normal Trap (Processor in Execute State)
        2.2.2.3 Special Trap (Processor in RED State)

        3       Traps - How SunOS Handles Them
        3.1     Generic Trap Handling
        3.2     Register Windows
        3.3     Interrupts

                        -=-=-=-=-=-=-=-=-=-=-=-=-

1       INTRODUCTION

        This document describes what traps are, how they work and how they
        are handled by the SunOS kernel.   We will look at traps for SPARC
        versions v7, v8 and v9 (v7 and v8 traps are essentially identical).

        In places, we will have to differentiate between the v8 and v9
        quite extensively as there are significant differences between the
        two.

        I assume that the readers are familiar with SPARC registers but I
        will give some expansion on the more obscure ones as I go :-)

        Finally, I have made every effort to make this accurate as well as
        informative but at the same time without too lengthy descriptions.
        Even so, in parts it may be heavy going and plain ascii doesn't
        leave much scope for clear diagrams.  Feel free to post or email
        questions and comments!

1.1     Who should read this document?

        Anyone who wants to know more about traps in detail  on  the SPARC
        architecture.  I strongly recommend that you refer to one of these
        two books for more information:

        The SPARC Architecture Manual, Version 8        ISBN 0-13-099227-5
        The SPARC Architecture Manual, Version 9        ISBN 0-13-825001-4

        as they contain more detail on some of these topics.

2       WHAT IS A TRAP?

        The design of SPARC as a RISC processor means that a lot of the
        functionality that is normally controlled by complex instructions
        has to be done by supervisor (kernel) software.  Examples of these
        could be memory exception handling or interrupt handling.  When
        a situation arises that requires special handling in this way,
        a trap occurs to cause the situation to be handled.  We'll look
        at this mechanism in more detail in this section.

2.1     How Traps are Caused

        Traps can be generated for a number of reasons and you can see
        a list of traps under /usr/include/sys/v7/machtrap.h or under
        /usr/include/sys/v9/machtrap.h for SPARC v7/v8 and v9 respectively.

        A trap can be caused either by an exception brought about by the
        execution of an instruction or by the occurrence of some external
        interrupt request not directly related to the instruction.  When
        the IU (Integer Unit, the part of the CPU that contains the
        general purpose registers, does integer math and executes the
        instructions) is about to execute an instruction it first checks
        to see if there are any exception or interrupt conditions pending
        and, if so, it selects the highest priority one and causes a trap.

        Traps are also used to signal hardware faults and malfunctions,
        for example a level15 asynchronous memory fault.  In some fatal
        conditions execution cannot continue and the machine will halt
        or the supervisor software will handle the trap by panicing.

        Next, we'll take a generic look at the different trap categories
        but we'll go into the version specific details later on.

2.1.1   Precise Traps

        A precise trap is brought about by an exception directly caused by
        the executing instruction.  This trap occurs before there is any
        tangible change in the program state of the program that contained
        the trapped instruction.

2.1.2   Deferred Traps

        A deferred trap is similar to a precise trap but in this case the
        program-visible state may have changed by the time the trap occurs.
        Such a trap may in theory occur one or more instructions after the
        trap inducing instruction has executed but it must occur before
        any subsequent instruction attempts to use any modified register
        or resource that the trap inducing instruction used.

        Did you get that?  Hmm.  Okay, here's an example.  Imagine that
        a floating point operation is being executed.  This does not
        happen synchronously with IU instructions and so it is possible
        that a floating point exception could occur as a deferred trap.

2.1.3   Disrupt/Interrupt Traps

        An interrupt trap, as you have probably guessed, is basically the
        assertion of an interrupt, either generated externally (from
        hardware) or internally (via software).  The delivery of interrupts
        is controlled by the PIL (Processor Interrupt Level) field of the
        PSR (Processor State Register), which specifies the minimum
        interrupt level to allow, and also by the mask of asserted bits in
        the IE (Interrupt Enable register...architecture specific).

        Under SPARC v9, we have a concept of a disrupt trap.  This is very
        similar to a deferred trap in that it could be related to an earlier
        instruction but in this case the trap is an unrecoverable error.

2.2     How Traps are Dispatched to the Kernel

        In this section we will look at the flow of execution into the
        kernel when a trap occurs.  This is different for SPARC v7/v8
        and v9 so we will split this section into two.

2.2.1   SPARC v7/v8

        When a trap occurs, the flow of execution jumps to an address which
        is calculated from the TBR (Trap Base Register) and the Trap Type,
        hereafter referred to as TT.  The sequence is as follows:

        1.  An exception/interrupt has been detected as pending by the IU

        2.  The IU multiplies the TT by 16 (TT << 0x4) as there are 4
            instructions per trap table entry.

        3.  The IU loads the address of the trap table (from the TBR) and
            adds the offset calculated in (2).

        4.  The CWP (Current Window Pointer) is decremented, so that we are
            in a new register window.

        5.  The trapped instruction (%pc) and the next instruction to be
            executed (%npc) are written into local registers %l1 and %l2

        6.  Traps are disabled and the current processor mode is set to
            "supervisor".  This is done by setting the ET bit to zero and
            the supervisor mode bit to one in the PSR (refer to the PSR
            description in /usr/include/v7/sys/psr.h).

        7.  Execution resumes at [TBR + (TT<<4)], as calculated in (3)

        Part of the SunOS kernel code is a trap table, which contains 255
        4-instruction entries, each entry corresponding to a trap type from
        0 to 0xff.  This structure is defined by the SPARC implementation.

        Each trap table entry basically contains a branch to a trap handling
        routine and may also load the PSR into a local register for use
        later.  Here's an example of a trap table entry:

                sethi   %hi(trap_handler), %l3          ! Load trap handler
                jmp     [%l3 + %lo(trap_handler)]       ! address and jump
                mov     %psr, %l0                       ! Delay: load %psr
                nop
 
2.2.2   SPARC v9

        The SPARC v9 case is quite different from the v7/v8 case mainly
        due to the concept of processor states and trap nesting.

        We still use a trap table concept under v9 but the destination
        address for the transfer of execution is calculated differently.
        Also, trap table entries for v9 are 8 instructions in size,
        except for spill/fill traps, in which case the entries are 32
        instructions in size.  Also, in a special state called the RED
        state (more on that later) we actually use a different trap table!

        Pretty different, huh?

        The trap table is divided into three parts.  The first half of the
        table is used for machine generated traps.  The next quarter is
        reserved for software initiated traps and the final quarter is
        reserved for future use.  The displacement into the trap table is
        defined by Trap Level (TL) and the Trap Type (TT) together.

        Let's take a look at this in some more detail.  I strongly advise
        that you obtain a copy of the version 9 SPARC architecture manual
        if you want to follow this in detail.

        When a trap occurs, the action taken depends on the TT, the current
        level of trap nesting (contained in the TL) and the processor state
        at that time.  Let's look at processor states and what we mean by
        normal and special traps so that the rest of this section has more
        chance of making sense!

2.2.2.1 Processor States, Normal and Special Traps

        The SPARC v9 processor is always in one of three states and these
        are:

        1.  Execute state.  This is the normal execution state.

        2.  RED state.  RED = Reset, Error and Debug.  This is a state that
            is reserved for handling traps when we are at the penultimate
            level of trap nesting or for handling hardware or software
            interrupts.

        3.  Error state.  This is a state that is entered when we have a
            trap occur at a point in time when we are at our maximum level
            of trap nesting (MAXTL) or an unrecoverable fatal error has
            occurred.

        Normal traps are traps that are processed when we are in the nice
        cosy execute state.  If we trap in RED state, then this is a special
        trap.  There is an implementation dependent address called RSTVaddr
        which contains the vector to the RED state trap table.  This vector
        could be set to overlay the same one in the TBR.  For a given trap
        in RED state we vector as follows:

        TT      Vector          Reason

        0       RSTVaddr|0x0    SPARC v8 style reset
        1       RSTVaddr|0x20   Power On Reset (POR)
        2       RSTVaddr|0x40   Watchdog Reset (WDR)
        3       RSTVaddr|0x60   Externally Initiated Reset (XIR)
        4       RSTVaddr|0x80   Software Initiated Reset (SIR)
        Others  RSTVaddr|0xa0   All other traps in RED state

        A fatal exception that causes us to drop into error state
        will cause the processor to note the exception and either halt,
        reset or watchdog reset.  After the reset, the processor enters
        RED state with a TL appropriate to the type of reset (usually
        maximum).  Also, the TT is set to the value of the original trap
        that caused the reset and NOT the TT value for the reset itself
        (ie. WDR - Watchdog reset or XIR - Externally Indicated Reset).

        Now that we have a concept of the different traps and processor
        states, let's look at the sequence of execution when a trap occurs
        to deliver the trap to the supervisor (kernel).

2.2.2.2 Normal Trap (Processor in Execute State)

        1.  If TL = MAXTL-1, the processor enters RED state (Goto 2.2.2.3).

        2.  TL = TL + 1

        3.  Processor state, %pc, %npc, CWP, CCR (Condition Code Register),
            TT and ASI (Address Space Identifier register) are saved.

        4.  The PSTATE (Processor State) register is updated as follows:

                a) RED field set to zero
                b) AM (Address Masking) disabled
                c) PRIV (Privileged Mode) enabled
                d) IE cleared, disabling interrupts
                e) AG set (Alternate Global Registers enabled)
                f) Endian mode set for traps (TLE)

            Refer to the architecture manual for a description of PSTATE

        5.  If TT is a register window trap, CWP is set to point to the
            register window to be accessed by the trap handler code.
            Possibilities are:

                a) TT = 0x24 (Clean Window), CWP = CWP + 1
                b) TT <= 0x80 AND TT <= 0xbf (Window Spill),
                   CWP = CWP + CANSAVE + 2.  CANSAVE is a register that
                   contains the number of register windows following the
                   CWP that are NOT in use.
                c) TT <= 0xc0 AND TT <= 0xff (Window fill),
                   CWP = CWP - 1

            For non-register window traps, CWP remains unchanged.

        6.  Control is transferred to the trap table at an address calculated
            as follows:

                New %pc =  TBA | (TL>0 ? 1: 0) | TL
                New %npc = TBA | (TL>0 ? 1: 0) | TL | 0x4

            Remember, TBA = Traptable Base Address, similar to the TBR in v8
            Execution then resumes at the new %pc and %npc

2.2.2.3 Special Trap (Processor in RED State)

        1.  TL = MAXTL

        2.  The existing state is preserved as in 2.2.2.2, step 3.

        3.  The PSTATE is modified as per 2.2.2.2, step 4 except that the
            RED field is asserted.

        4.  If TT is a register window trap, CWP processing occurs as in
            2.2.2.2, step 5.

        5.  Implementation specific state changed may occur.  For example,
            the MMU may be disabled.

        6.  Control is transferred to the RED state trap table subject to
            the trap type.  Look back to 2.2.2.1 for the RSTVaddr
            information to see how this vector is made.

        This may seem rather complicated but once you have the picture
        built clearly it will all fall into place.  Post or email if you
        need clarification.

3       TRAPS - HOW SUNOS HANDLES THEM

        In this section we will look at how SunOS handles traps and look
        at some of the alternatives which were available.  Despite all the
        differences between SPARC v8 and v9 traps I'll do a fairly generic
        description here as it really isn't necessary to describe in detail
        what SunOS does for v9 traps as you can see from the previous
        section what the differences in trap processing are.  Suffice to
        say that the SunOS kernel adheres to those rules.  Instead, we'll
        concentrate on the principles used by the kernel when handling
        various traps.

3.1     Generic Trap Handling

        We'll look at some specifics in a moment but first we'll cover the
        generic trap handling algorithm.

        When traps are handled, the typical procedure is as follows:

        1.  Check CWP.  If we need to handle the trap by jumping to
            'C' (which would use save and restore instructions between
            function calls) then we must make sure we won't have cause
            an overflow when we dive into 'C'.  If we do detect that
            this would be a problem we do the overflow processing now.

        2.  Is this an interrupt?  If so, jump to the interrupt handler.
            Refer to section 3.3 on interrupts.

        3.  Enable traps and dive into the standard trap handler.  We
            enable traps so that we can catch any exceptions brought
            about by handling *this* trap without causing a watchdog reset.

        4.  On return from the trap handler, we check the CWP with the
            CWP we came in with at the start to see if we have to undo
            the overflow processing we might have done before, so that
            we don't get an underflow when we return to the trapped
            instruction (or worse, execution continues in the WRONG window).

        5.  Before we actually return to the trapped instruction, we check
            to see if kprunrun is asserted (ie. a higher priority lightweight
            process is waiting to run).  If so, we allow preemption to
            occur.

        Traps are used by SunOS for system calls as well as for machine
        generated exceptions.  The parameters to the system call are
        placed in the output registers, the number of the system call
        required (see /usr/include/sys/syscall.h) is placed in %g1 and
        then it executes a "ta 0x8" instruction.  This appears in the kernel
        as a trap with TT = 0x88 and the system trap handler determines
        this to be a system call and calls the relevant function as per
        the system call number in %g1.
 
        Occasionally, a process will attempt to execute from a page of
        VM that is not mapped in (ie. it is marked invalid in the MMU)
        and this will cause a text fault trap.  The kernel will then
        attempt to map in the required text page and resume execution.
        However, if the process does not have the correct permissions
        or the mapping cannot be satisfied then the kernel will mark
        a pending SIGSEGV segmentation violation against that process
        and then resume execution of the process.  A similar scenario
        applies to data faults; a process attempts to read or write to
        an address in a page marked invalid in the MMU and the kernel
        will attempt to map in the corresponding page for this address
        if possible (ie. maybe the page has been swapped out or this
        is the first attempt to read from that page and so we demand-page
        it in).  I'll explain all this in detail in another text on
        process address spaces, paging and swapping which I plan to do
        as soon as I get time.

        A "bad trap" is simply a trap that cannot be handled (or isn't
        supported).  Usually under SunOS a bad trap has a type of 9 or
        2, for data or text fault respectively (maybe 7 for alignment in
        some cases).

3.2     Register Windows
=================================TOP===============================
 Q86: Is there anything similar to posix conditions variables in Win32 API ?  

Ian Emmons wrote:
> 
> Dave Butenhof wrote:
> >
> > kumari wrote:
> > >
> > > Is there anything similar to posix conditions variables in Win32 API ?
> > > Thanks in advance for any help.
> > > -Kumari
> >
> > The answer depends very much upon what you mean by the question. Win32
> > has "events", which can be used to accomplish similar things, so the
> > answer is clearly "yes". Win32 events, however, behave, in detail, very
> > differently, and are used differently, so the answer is clearly "no".
> > Which answer do you prefer? ;-)
> 
> Good answer, Dave.  This is one of the most frustrating things about the
> Win32 threading API.  CV's are incredibly powerful and fairly easy to use,
> but Win32 unfortunately ommited them.
> 
> In WinNT 4.0, there is a new API called SignalObjectAndWait which can be used
> to implement a CV pretty easily.  There are two problems:
> 
> (1) This API is not available on WinNT 3.51 or Win95.  Hopefully it will show
> up in Win97, but I don't know for sure.
> 
> (2) Using this API with a mutex and an auto-reset event, you can create a
> CV-lookalike where PulseEvent will behave like pthread_cond_signal, but there
> is no way to immitate pthread_cond_broadcast.  If you use a mutex and a
> manual event, PulseEvent will behave like pthread_cond_broadcast, but there
> is no way to immitate pthread_cond_signal.  (Sigh ...)
> 
> I know ACE has a Win32 CV that works in general, but I seem to recall Doug
> Schmidt saying that it's very complex and not very efficient.
> 
> Ian
=================================TOP===============================
 Q87: What if a cond_timedwait() times out AND the condition is TRUE?  

  [This comment is phrased in terms of the JAVA API, but the issues are the
same. -Bil]

> After thinking about this further even your simple example can fail.
> Consider this situation. A number of threads are waiting on a condition,
> some indefinitely, some with timeouts. Another thread changes the
> condition, sets your state variable and does a notify(). One of the waiting
> threads is removed from the wait-set and now vies for the object lock. At
> about the same time a timeout expires on one of the other waiting threads
> and it too is removed from the wait-set and vies for the lock - it gets the
> lock! This timed-out thread now checks the state variable and wrongly
> concludes that it received a notification.
> 
> In more complex situations where there are multiple conditions and usage of
> notifyAll() even more things can go wrong.

  You are correct with everything you say, right up until the very last word.
The behavior is NOT wrong.  It may not be what you *expected*, but it's not
wrong.  This is a point that's a bit difficult to get sometimes, and it drives
the real-time crowd to distraction (as well it should), but for us time-shared
folks, it's cool.

  When you get a time-out, you've got a choice to make.  Depending upon what
you want from your program, you may choose to say "Timed-out! Signal error!"
or you may choose to check the condition and ignore the time out should it
be true.  You're the programmer.

  A important detail here...  Everything works CORRECTLY.  In particular, if
a thread receives a wakeup, it is removed from the wait queue at that point
and CANNOT subsequently receive a timeout.  (Hence it may take another hour
before it obtains the mutex, but that's OK.)  A thread which times out will
also be removed from the sleep queue and a subsequent wakeup
(pthread_cond_signal()) will be delivered to the next sleeping thread (if
any).

=================================TOP===============================
 Q88: How can I recover from a dying thread?  


[OK.  So that's not *exactly* the question being addressed here, but that's
the most important issue.  -Bil]

David Preisler wrote:
> 
> I wish to create an *efficient* and *reliable* multi process/multi threaded
> algorithm that will allow many simultaneous readers (for efficiency) to access
> a block of shared memory,  but allows one and only one writer.
> 
> How could a read counter be guaranteed to be always correct even if your read
> thread or process dies???

Sorry to disillusion you, but this is impossible. Remember that you're
talking about SHARED MEMORY, and assuming that SOMETHING HAS GONE WRONG
with some party that has access to this shared memory. Therefore, there
is no possibility of any guarantees about the shared memory -- including
the state of the read-write lock.

You can approximate the guarantees you want by having some third party
record the identity of each party accessing the lock, and periodically
validate their continued existence. You could then, (assuming you'd
coded your own read-write lock that allowed this sort of manipulation by
a third party), "unlock" on behalf of the deceased read lock holder.
Just remember that the fact that the party had DECLARED read-only
intent, by locking for read access, doesn't guarantee that, in the
throes of death, it couldn't have somehow unintentially modified the
shared data. A read-lock really is nothing more than a statement of
intent, after all. And how far do you wish to trust that statement from
a thread or process that's (presumably equally unintentially) blowing
its cookies?

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698       http://www.awl.com/cp/butenhof/posix.html |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q89: How to implement POSIX Condition variables in Win32?  

Subject: Re: How to implement POSIX Condition variables in Win32 (LONG)
Douglas C. Schmidt wrote:

the following function creates and initializes a condition variable. int pthread_cond_init (pthread_cond_t *cv, const pthread_condattr_t *); { cv->waiters_ = 0;
  cv->generation_count_ = 0;
  cv->release_count_ = 0;

  // Create a manual-reset Event.
  cv->event_ =
    ::CreateEvent (NULL,  /* no security */
                   TRUE,  /* manual reset */
                   FALSE,  /* non-signalled */
                   NULL); /* unnamed */
}

The following pthread_cond_wait function waits for a condition 
and atomically releases the associated generation_count_;

  cv->waiters++;

  for
  {
    ::LeaveCriticalSection (external_mutex);

    // Wait until the event is signaled.
    ::WaitForSingleObject (cv->event_, INFINITE);

    ::EnterCriticalSection (external_mutex);

    // Exit the loop when the event_>
    // is signaled and there are still waiting
    // threads from this generation that haven't
    // been released from this wait yet.
    if (cv->release_count_ 0
        && cv->generation_count_ != c)
      break;
  }

  --cv->waiters;

  // If we're the last waiter to be notified
  // then reset the manual event.

  cv->generation_count_--;

  if (cv->generation_count_ == 0)
    ::ResetEvent (cv->event_);
}

This function loops until the event_ HANDLE is signaled and at least
one thread from this ``generation'' hasn't been released from the wait
yet.  The generation_count_ field is used incremented every time the
event_ is signal via pthread_cond_broadcast or pthread_cond_signal.
It tries to eliminate the fairness problems with Solution 1, so that
we don't respond to notifications that have occurred in a previous
``generation,'' i.e., before the current group of threads started
waiting.

The following function notifies a single thread waiting on a Condition
Variable:

int
pthread_cond_signal (pthread_cond_t *cv)
{
  if (cv->waiters_ cv->release_count_)
    {
      ::SetEvent (cv->event_);
      cv->release_count_++;
      cv->generation_count++;
    }
}

Note that we only signal the Event if there are more waiters than
threads currently being released.

Finally, the following function notifies all threads waiting on a
Condition Variable:

int
pthread_cond_broadcast (pthread_cond_t *cv)
{
  if (cv->waiters_ 0)
    {
      ::SetEvent (cv->event_);
      cv->release_count_ = cv->waiters_;
      cv->generation_count_++;
    }
}

Unfortunately, this implementation has the following drawbacks:
 
1. Busy-waiting -- This solution can result in busy-waiting if the
waiting thread has highest priority.  The problem is that once
pthread_cond_broadcast signals the manual reset event_ it remains
signaled.  Therefore, the highest priority thread may cycle endlessly
through the for loop in pthread_cond_wait.

2. Unfairness -- The for loop in pthread_cond_wait leaves the critical
section before calling WaitForSingleObject.  Thus, it's possible that
another thread can acquire the external_mutex and call
pthread_cond_signal or pthread_cond_broadcast again during this
unprotected region.  Thus, the generation_count_ will increase, which
may fool the waiting thread into breaking out of the loop prematurely
and stealing a release that was intended for another thread.

3. Potential for race conditions -- This code is only correct provided
that pthread_cond_signal and pthread_cond_broadcast only ever called
by a thread that holds the external_mutex.  That is, code that uses
the classic "condition variable signal" idiom shown above will work.

Doug
--
Dr. Douglas C. Schmidt                  ([email protected])
Department of Computer Science, Washington University
St. Louis, MO 63130. Work #: (314) 935-4215; FAX #: (314) 935-7302
http://www.cs.wustl.edu/~schmidt/
=================================TOP===============================
-- 
================
Bil LambdaCS.com

=================================TOP===============================
 Q90: Linux pthreads and X11  

Date: Mon, 16 Feb 1998 17:47:03 +0000
Organization: Visix Software

Steve Cusack wrote:

> I've just started using Linux pthreads and have immediately run into
> the fact that the package appears to be incompatible with X.  I've
> read (via DejaNews) that X is "threads unsafe" and that others have
> had similar problems (X refusing to start).  Does anyone have X11 and
> pthreads working together on a Linux system?  If so, what did you have
> to do?

I ported Vibe (a multi-threaded Java IDE) to Linux, so I have made X11 and
linuxThreads work together.  Its not easy, unless you can target a glibc2
platform.

You need to do one of two things: either get/create a recompilied
-D_REENTRANT version of the X libraries, or patch linuxThreads to use the
extern int errno as the errno for the initial thread.  I chose to create
thread aware Xlibs.  (That was not fun.  For some reason the build process
would get into an infinite loop.  I don't remember how I got it to work)

You should be able to search Deja-news for pointers to patching
LinuxThreads.

Doug
-- [email protected]
Replace "xyzzy" with Victor India Sierra India X-Ray to email me.

        ================

Get yourself a copy of Redhat 5.0. It comes with pthreads and thread
safe X libs.

    -Arun

        ================

For such-compiled binaries for ix86-libc5, see

  ftp://ftp.dent.med.uni-muenchen.de/pub/wmglo/XFree86-3.3-libs.tar.gz

Strange, for me it was very easy.  Just adding -D_REENTRANT next to
-D_POSIX_SOURCE in linux.cf did the trick.

But remember:  These libs are still not thread-safe (you can make only
one X11 call at a time -- I don't think this is too bad).

The better option at this stage really is glibc2 with X11 libs
compiled with XTHREADS.

Regards,
Wolfram.
=================================TOP===============================
 Q91: One thread runs too much, then the next thread runs too much!  

[I've seen variations on this concern often.  Johann describes the problem
very well (even if he can't find the "shift" key).  -Bil]

===================  The Problem  ================================
Johann Leichtl wrote:
> hi,
>
> i have a a problem using pthreads in c++.
>
> basically what i want to do is have class that manages a ring buffer and
> say 2 threads, where one adds entries to the buffer and one that removes
> them.
> i have a global object that represents the buffer and the member
> functions make sure that adding and removing entries from the buffer
> work ok.
>
> the 2 functions:
>
> inline void ringbuf::put(int req)
> {
>   pthread_mutex_lock(&mBufLock;);
>   while(elemNum == size)
>     pthread_cond_wait(&cNotFull;, &mBufLock;);
>   buf[next] = req;
>   next = (next + 1) % size;
>   elemNum++;
>   pthread_cond_signal(&cNotEmpty;);
>   pthread_mutex_unlock(&mBufLock;);
> }
>
> inline void ringbuf::get(int& req)
> {
>   pthread_mutex_lock(&mBufLock;);
>   while(elemNum == 0)
>     pthread_cond_wait(&cNotEmpty;, &mBufLock;);
>   req = buf[last];
>   last = (last + 1) % size;
>   elemNum--;
>   pthread_cond_signal(&cNotFull;);
>   pthread_mutex_unlock(&mBufLock;);
> }
>
> now my problem is that my consumer thread only wakes up after the buffer
> is full. i tried different buffer sizes and simulated work in both
> producer and consumer.
> when i use a sleep() function in the producer (and or consumer) i can
> get the thing to look at the buffer earlier.
>
> i was wondering if anybody would have some input on what the problem
> could be here. i've done something similar with UI threads and not C++
> and it works fine.
>
> thanks a lot.
>
> hans
> [email protected]

===================  The Solutions  ================================

Use either threads w/ system scope or call thr_setconcurrency to
increase the concurrency level...

Here's a handy thing to stick in a *Solaris* pthreads program:

#include 
#include 

        thr_setconcurrency(sysconf(_SC_NPROCESSORS_ONLN)+1);

This will give you as much actual concurrency as there are processors
on-line plus one.  It's a starting point rather than a fix-all, but
will cure some of the more obvious problems...

- Bart

--
Bart Smaalders                  Solaris Clustering      SunSoft
[email protected]         (415) 786-5335          MS UMPK17-201
http://playground.sun.com/~barts                        901 San Antonio Road
                                                        Palo Alto, CA
94303

    =================================TOP===================
Johann,

  No, actually you *don't* have a problem.

  Your program works correctly, it just happens to work in a slightly
unexpected fashion.  The functions that call put & get are probably
unrealistically simple and you are using local scheduling.  First the
buffer fills up 100%, then it completely empties, then it fills up
again, etc.

  Make your threads system scoped and you'll get what you expect.
[You'll notice Bart suggests a different method for obtaining the
same results (ie. more LWPs).  I like this method because I think
it's a clearer statement of intention AND PROCTOOL will give me
LWP statistics, but not thread statistics.]


  (You can look on the web page below for exactly this program, written
in C, one_queue_solution.c.)

-Bil
=================================TOP===============================
 Q92: How do priority levels work?  

Kamal Kapila wrote:
> 
> Hi there,
> 
> I'm working on an internal package to provide platform independant
> thread services (the initial platforms are DECUNIX 4.0 and Windows NT).
> The problem I'm having is understanding the thread scheduling on
> DECUNIX.
> 
> It would seem to me logical that the threads of a process would have the
> same priority and policy of their associated process by default.
> However, when I check the process priority/policy I get completely
> different values from when I check the individual thread priorities and
> policies.  In fact, the priority values do not seem to even follow the
> same scale (I can set process priorities from 0-63, while thread
> priorities go only from 0-31).  In addition, setting the process
> priority does not seem to effect the thread priorities at all (!).

Basically, there are "interaction issues" in implementing a 2-level
scheduling model (as in Digital UNIX 4.0), that POSIX didn't attempt to
nail down. We deferred dealing with these issues until some form of
industry concensus emerged. That industry concensus has, since, not
merely "emerged", but has become a mandatory standard in the new Single
UNIX Specification, Version 2 (UNIX98).

With 2-level scheduling, it really doesn't make much sense to "inherit"
scheduling attributes from the process -- because those attributes MEAN
entirely different things. Digital UNIX, by the way, doesn't really have
a "process" -- it has (modified) Mach tasks and threads. (There are a
set of structures layered over tasks to let the traditional UNIX kernel
code deal with "processes" in a more or less familiar way, but a process
is really sheer illusion.) Since tasks aren't scheduled, they really
have no scheduling attributes -- threads do. Since a non-threaded
process has a task and a single thread, the 1003.1b (realtime)
scheduling functions operate, in general, on the "initial thread" of the
specified "process".

The kernel thread scheduling attributes control scheduling between
various kernel threads. But a POSIX thread is really a user object, that
we map onto one or more kernel threads (which we call "virtual
processors"). Pretending to set the scheduling attributes of this thread
to the "process" attributes makes no sense, because the scheduling
domain is different. POSIX threads are scheduled only against other
threads within the process -- not against kernel threads in other
processes.

POSIX provides a way to create threads that you really want to be
scheduled against other kernel threads -- essentially, forcing the POSIX
thread to be "bound" to a kernel thread itself, at the expense of (often
substantially) higher scheduling costs. This is called "system
contention scope". Digital UNIX 4.0 didn't support system contention
scope (which is an optional feature of POSIX), but we've added it for
the next version (4.0D).

Each system contention scope (SCS) thread has its own scheduling
attributes, independent of the process. While it might make some
intuitive sense to inherit the process priority, POSIX doesn't provide
any such semantics. A newly created thread either has explicit
scheduling attributes, or inherits the attributes of the thread that
created it. Of course, since setting the "process" attributes affects
the initial thread, threads that IT creates will inherit the "process"
attributes by default. But changing the "process" attributes won't (and
shouldn't) affect any SCS threads in the process.

The ambiguity (and the only relevant question for the implementation
you're using, which doesn't support SCS threads), is, what happens to
the virtual processors that are used to execute POSIX threads, when the
"process" scheduling attributes are changed? And with what attributes
should they run initially? UNIX98 removes the (intentional) POSIX
ambiguity by saying that setting the 1003.1b scheduling attributes of
the "process" WILL affect all "kernel entities" (our virtual processors,
Sun's LWPs) used to execute process contention scope (PCS, the opposite
of SCS) threads. By extension, the virtual processors should initially
run with the existing process scheduling attributes.

This will be true of any UNIX98 branded system -- but until then,
there's no portable rules.

The fact that the POSIX thread interfaces don't use the same priority
range as the system is a stupid oversight -- I just didn't think about
it when we converted from DCE threads to POSIX threads for 4.0. This has
been fixed for 4.0D, though it's a bit too substantial a change (and
with some potential risk of binary incompatibilities) for a patch.

> (BTW, I am using sched_getparam() and sched_getscheduler() to get the
> process related values and  pthread_getparam() to get the thread related
> values).

Right.

> Specifically, I have the following questions :
> 
> - What is the relationship between the process priority/policy and the
> thread priority and policy  ?

There's very little relationship. Each POSIX thread (SCS or PCS) has its
own scheduling attributes (priority and policy) that are completely
independent of "process" attributes. UNIX98, however, says that the
"kernel entities" used to execute PCS POSIX threads WILL be affected by
changes to the "process" scheduling attributes -- but SCS threads will
not (and should not) be affected by such changes. (Nor will the
scheduling attributes of PCS threads, even though their "system
scheduling attributes" effectively come from the virtual processor,
which is affected.)

> - Does the scheduler schedule individual threads independently, or are
> processes scheduled, with a process's threads then sharing the process
> CPU time?

As I said, there's no such thing as a process, and the closest analog,
the Mach task, isn't a schedulable entity. All threads are scheduled
independently -- each has its own scheduling attributes, its own time
slice quantum, etc. On Digital UNIX 4.0, with only PCS threads, the
kernel schedules the virtual processor threads of all the processes
(plus the single kernel threads associated with all non-threaded
processes). Threaded processes also contain a user-mode scheduler, which
assigns PCS threads to the various virtual processors, based on the PCS
thread scheduling attributes. (A process has one virtual processor for
each available physical processor on the system.)

On Digital UNIX 4.0D, with SCS thread support added, each process may
also have any number of SCS threads, which map directly to individual
and independent kernel threads. SCS threads are scheduled the same as
virtual processors -- each has its own scheduling attributes, time slice
quantum, etc.

(It might seem that managing CPU time by kernel threads rather than by
processes allows users to monopolize the system by creating lots of
kernel threads. But they could do that by creating lots of processes,
too... and a kernel thread is cheaper for the system than a process,
which is really a thread plus a task. The ability to create new kernel
threads, as well as processes, is limited both by user and system
quotas. And of course, in 4.0, users can't actually create new kernel
threads -- only POSIX threads, which are mapped to the process' existing
virtual processors.)

So each process presents a set of runnable kernel threads to the kernel:
A mix of SCS threads and the various PCS threads currently mapped on to
one or more virtual processors. The kernel then determines which kernel
threads to schedule on each processor. (That's why it's called "2-level
scheduling".)

> - Is the thread's overall priority a combination of the process priority
> and the individual thread priority ? If so, how is this determined ?

Currently, "process" priority is irrelevant for a threaded process.
Virtual processors don't inherit the process priority. (Actually, they
sort of do, and the first virtual processor is the initial process
thread, which can be changed using the 1003.1b functions -- but the
kernel generates "replacement" virtual processors at various times, and
these currently are always set to the default scheduling attributes
[timeshare policy and priority 19].)

POSIX thread priority determines which threads the user-mode scheduler
assigns to the various virtual processors. Because the virtual processor
priority doesn't change (the whole point of 2-level scheduling is to
avoid expensive kernel calls), the POSIX thread priority has no effect
on the kernel scheduling. That's OK, except in rare cases where
applications comprising multiple PROCESSES have threads (in different
processes) that really need to directly preempt each other based on
priority.

> I have read through all of the Digital documentation that I have but I
> have not been able to find any clear answers to my questions.

A description of the behavior (though in less technical/internal detail
than the one in this posting) can be found in Appendix A (section A.3)
of the Digital UNIX "Guide to DECthreads" manual.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698       http://www.awl.com/cp/butenhof/posix.html |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP===============================
 Q93: C++ member function as the startup routine for pthread_create().  



Hi
You need a static member function
    static void *function_name(void  *);
in the class declaration

Usually then you pass the classes address as the parameter so the
static function can access the nonstatic members of the class

- Robert

On 19 Sep 1997 03:15:22 GMT, [email protected] (Phil Romig) wrote:
>
>
>I know I should be able to figure this out, but I'm missing something.
>
>I want to pass a member function as the startup routine for pthread_create().
>That is, I want to create an instance of a class and then pass one of
>the public member functions as the start routine to pthread_create().
>
>I believe the question comes down to how describe the address of the
>member.  Simple qualification (class::member) will not work because I
>need the address of the of the function that goes with the particular
>instance of the class.
>
>For the record I'm working on an HPUX 10.01 system so I'm using pthreads
>draft 4, rather than the current standard.  
>
>Any advice, pointers and suggestions are welcome.
>thanks
>Phil
>[email protected]
>
>
>A quick example of what I want to try:
>
>  class foo {
>  public:
>   foo(int i)
>   void *go(void *arg);
>  }
>
> main() {
>  foo *bar = new foo(1);
>
>  pthread_create(...,&(bar->go),....);
>}


=================================TOP===============================
 Q94: Spurious wakeups, absolute time, and pthread_cond_timedwait() 

[Bil: The summary is "Retest conditions with CVs." and "The time-out is an
absolute time because it is."  (NB: Deltas are a proposed extension to POSIX.)
This is a nice exposition.]

Brian Silver wrote:
> 
> Ben Self wrote:
> >
> [Snip]
> > The standard specifies that pthread_cond_wait() and
> > pthread_cond_timedwait() may have spurious wakeups.  The reason for this
> > is that a completly reliable once and only once wake up protocol can be
> > excessively expensive for some asymetric multiprocessor systems.
> 
> Well, maybe I'm being a bit anal about this, but this
> really isn't the case. If it was, then you'd have the
> same issue for mutexes as well, and the standard does
> not allow for spurious wakes on mutexes.

[Bil: Actually, you DO get spurious (define "spurious"!) wakeups with mutexes,
YOU just never see them.]
 
> The "while(predicate)/wait" construct is very common
> in concurrent environments (regardless of their symetry).
> The reason is that since the environment is highly
> unpredictable, while you were coming back out of the
> wait, the state of the thing that you were waiting for
> may have changed.
> 
> This construct is used to impliment mutexes as well,
> its just that you don't see it since the predicate
> is known; it is the state of the mutex lock. Cv's force
> the construct to the user code because the predicate
> is not known to the impliment of the cv itself. The
> warnings about spurious wakes are taken seriously when
> mutexes are implimented, and are accounted for in the
> exact same "while(predicate)/wait" construct.
> 
> Wake-only-once doesn't really help. It will remove the
> addition of spurious wakes, but it won't account for
> "valid wake, but the predicate changed". Implimenting
> wake-only-once is expensive when you consider that this
> solution solves both problems.
> 
> Also note that the mutex lock around the predicate
> doesn't solve this problem either. There is a race
> that starts once you see the wake and ends once you
> reaquire the mutex. In that time, another thread can
> get the mutex and change the data (believe me, it
> happens - more often than you'd expect). When you
> reaquire the mutex, and exit the wait, the predicate
> has changed and you'll need to go back to waiting.
> 
> Now, a wake-only-once,-and-gimme-that-mutex atomic
> operation might be nice .
> 
> Brian.

I am not  reposting to be defensive or argumentative.  Upon reflection,
however, I have come to the conclusion that neither I nor subsequent
posters have really dealt with the original poster's question let alone
the new topics that we have thrown about. 

Since this is a response largely to Brian Silver's post, a person I have
a good deal of respect for, I have chosen to include some quotes form
Dave Butenhof's book, Programming with POSIX Threads, because we both
know it and have a mutual admiration for his work.

First of and most importantly the original question I attempted to
answer was:

Fred A. Kulack wrote:
> A side question for the rest of the group...
> All the applications I've seen use a delta time for the wait and must
> calculate the delta each time a cond_timedwait is done. What's the rational
> for the
> Posix functions using an ABSOLUTE time?

I was hoping for an uncomplicated answer and the spurious wakeup issue
seemed to fit the bill.  My writing and thinking however was too
simplistic to provide any meaningful insight.  So I will try again. 
Please realize that some of what you will read below is purely personal
supposition.  Chime in if I have misinformed.

1)  Spurious wakeup is a reason for passing a absolute value to
cond_timedwait.  It is not the reason or even a particularly important
reason.  The standard (POSIX 1003.1c-1995) specifically states that a
compliant implementation of pthread_cond_timedwait() may suffer from
spurious wakeups.  It therefore is reasonable to use and absolute
timeout value instead of an delta to simplify the act of retry.

2)  More importantly it is also very likely a performance issue.  Most
systems when scheduling a software interrupt use an absolute value that
reflects an offset into the OS's epoch.  To constantly be re-evaluating
a delta in user code is excessively expensive especially if most systems
really want an absolute value anyway.

3)  Also their is the reality that the structure timespec is the high
resolution time value of choice in POSIX.  And timespec happens to
represent its time as absolute time.  Add into that the needs of the
powerful realtime group that had a great impact of the shape of POSIX
1003.1c.   What integral unit would we use for a delta anyway? and would
it be in nanoseconds?  Eeak!

4)  Most importantly one would hope that the interface were constructed
to promote good coding techniques.  As Brian Silver stated the
"while(predicate)/wait" idiom is an important technique for far more
reasons than just spurious wakeups.  By using an absolute timeout value
as opposed to a delta this idiom is directly supported by easing its
use.

When I originally brought up the  "while(predicate)/wait" idiom it was
because spurious wakeups would necessitate retrying the predicate.  I
did not intend to state that this was the only or even a particularly
important reason for the pattern.  The while "while(predicate)/wait"
idiom or an equivalent is essential to programming with condition
variables.

1)  Most importantly is the reason Brian silver stated, "There is a race
that starts once you see the wake and ends once you reacquire the
mutex."  It would be difficult and detrimental to concurrency to
construct through synchronization a situation that did not require
re-testing of the predicate after a wakeup.  This is why Brian's magic
bullet "wake-only-once,-and-gimme-that-mutex atomic" does not exist. 
Although it would be nice.

2)  Spurious wakeups do exist. Be consoled by the fact that "The race
condition that cause spurious wakeups should be considered rare.
[Butenhof]"

3)  Also It enables a powerful technique that I have been using for a
several years with great success that Dave Butenhof refers to as "loose
predicates".  "For a lot of reasons it is often easy and convenient to
use approximations of actual state.  For example, 'there may be work'
instead of 'there is work'."  I will go one step beyond that in my
experience of coding distributed web servers there are situations when
the notification mechanism cannot know with certainly that there is work
without actually have performed the entirety of the task itself.  Often
the best a distributed component has to work with is trends and
potentialities.

Lastly, (whew ;) I believe that I have overstated the significance of
the performance implications of only once wakeups.  "Excessively
expensive" is a bit strong without further qualification.  If it were
such a paramount issue Brian Silver is right, mutexes would suffer from
the same restrictions and they absolutely do not.  

There is a performance issue that I have run across many times and have
seen cited in many references including : "Spurious wakeups may sound
strange but on some multiprocessor systems, making condition wakeup
completely predictable might substantially slow all condition variable
operations. [Butenhof]"  Never-the-less, it is the fact that making
wakeup completely predictable does not get you that much.  You still
need to retest your predicate.  In the end it is such an easy and cheap
thing when taken in the context of the overhead of the synchronization
and latency of the wait.

--ben


-----------
Ben R. Self
[email protected]

www.opentext.com
Open Text Corporation -- Home of Livelink Intranet

        More on spurious wakeups

It is so because implementations can sometimes not avoid inserting
these spurious wakeups; it might be costly to prevent them.

Perhaps more importantly, your own program's logic can introduce spurious
wakeups which cannot be eliminated. This can start happening as soon as there
are more than two threads.

You see, a condition waiting thread which has been signaled may have to compete
with another thread in order to re-acquire the mutex.  If that other thread
gets the mutex first, it can change the predicate, so that when finally the
original thread acquires it, the predicate is false.

This is also a spurious wakeup, for all purposes.  To make this form of
spurious wakeup go away, the semantics of condition variables would have to
change in troublesome ways, back to the original monitors and conditions
concept introduced by Quicksort father C. A. R. Hoare. Under Hoare's monitors
and conditions signaling a condition would atomically transfer the monitor to
the first task waiting on the condition, so that woken task could just assume
that the predicate is true:  if (!predicate()) wait(&condition;); /* okay */

The very useful broadcast operation does not quite fit into Hoare's model, for
obvious reasons; the signaler can choose only one task to become the
next monitor owner.  

Also, such atomic transfers of lock ownership are wasteful, especially on a
multiprocessor; the ownership transfer spans an entire context switch from one
task to another, during which that lock is not available to other tasks.
The switch can take thousands of cycles, inflating the length of a small
critical region hundreds of times!

Lastly, a problem with Hoare's approach is that a ``clique'' of tasks can form
which bounce ownership of the monitor among themselves, not allowing any other
task entry into the monitor.  No reliable provision can be made for
priority-based entry into the monitor, because the signal operation implicitly
ingores such priority; at best it can choose the highest priority thread that
is waiting on the condition, which ignores tasks that are waiting to get into
the monitor.  In the POSIX model, a condition variable signal merely wakes up a
thread, making it runnable. The scheduling policy will effectively decide
fairness, by selecting who gets to run from among runnable threads. Waking up
of threads waiting on monitors and conditions is done in priority order also,
depending on the scheduling policy.
  
> You know, I wonder if the designers of pthreads used logic like this:
> users of condition variables have to check the condition on exit anyway,
> so we will not be placing any additional burden on them if we allow
> spurious wakeups; and since it is conceivable that allowing spurious
> wakeups could make an implementation faster, it can only help if we
> allow them.
>
> They may not have had any particular implementation in mind.

You're actually not far off at all, except you didn't push it far enough.

The intent was to force correct/robust code by requiring predicate loops. This was
driven by the provably correct academic contingent among the "core threadies" in
the working group, though I don't think anyone really disagreed with the intent
once they understood what it meant.

We followed that intent with several levels of justification. The first was that
"religiously" using a loop protects the application against its own imperfect
coding practices. The second was that it wasn't difficult to abstractly imagine
machines and implementation code that could exploit this requirement to improve
the performance of average condition wait operations through optimizing the
synchronization mechanisms.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q95: Conformance with POSIX 1003.1c vs. POSIX 1003.4a? 

christof ameye we41 xxxx wrote:

> Some pthread libraries talk about conformance with POSIX 1003.1c and
> others of conformance with POSIX 1003.4a. What are the
> differences/similarities ?
> I don't realy need the answer, but it might be interesting to know ...

First off, "conformance" to 1003.4a is a completely meaningless
statement. There's no such thing, because 1003.4a was never a standard.
Furthermore, I strongly doubt that there ever *were* any implementations
that could conform even if there was a way to apply meaning to the
statement.

1003.4a was the original name of the thread standard -- named for the
fact that it was developed by the realtime group (POSIX working group
designation 1003.4). However, it, like the original realtime standard,
was really an amendment to the "base standard", 1003.1. Eventually,
POSIX decided to resolve the confusion by creating more -- and renaming
all O/S API standards into the 1003.1 space. Thus, 1003.4 became
1003.1b, 1003.4a became 1003.1c, 1003.4b became 1003.1d, and so forth.

There were various draft versions of the standard while it was still
named 1003.4a, but all are substantially different from the actual
standard, and to none of them, technically, can any implementation
"conform". The most common draft is 4, which was the (loose) basis for
the "DCE thread" api, part of The Open Group's DCE suite. There's at
least one freeware implementation that claimed to be roughly draft 6.
IBM's AIX operating system provides a draft 7 implementation. The
"pthread" interface on Solaris 2.4 was draft 8. There is also at least
one implementation which I've seen claiming to be "draft 10". Draft 10
was the final draft, which was was accepted by the IEEE standards board
and by ISO/IEC with only "minor" editorial changes. Nevertheless, draft
10 is NOT the standard, and, technically, one cannot "conform" to it.
"Draft 10" and "1003.1c-1995" are NOT interchangeable terms.

Finally, because 1003.1c-1995 was never published as a separate
document, the official reference is the 1003.1-1996 standard, which
includes 1003.1b-1993 (realtime), 1003.1c-1995 (threads), and
1003.1i-1995 (corrections to the realtime amendment).

In terms of someone writing programs, a lot of that is irrelevant. But
you need to be aware that there's no real definition of "conformance"
for any drafts, so one vendor's "draft 4" is not necessarily the same as
another's "draft 4", and, while that might be inconvenient for you,
there's nothing "wrong" with it. (Although, from the POSIX standards
point of view, it was foolish and irresponsible of "them" [by which I
really mean "us", since I wrote the most common draft 4 implementation,
the original DCE threads reference library ;-) ] to use the "pthread"
prefix at all.)

There are MANY differences between "draft 4" and standard POSIX threads.
There are many (though slightly fewer) differences between draft 7 or 8
and the standard. There are even some differences between draft 10 and
the standard. Look at a move between any two drafts, or between any
draft and the standard, as a PORT to an entirely new threading library
that has some similarities. Be very careful of the details, especially
where things "appear to be the same". And, if you're stuck with a draft
implementation, lobby the vendor to provide a full conforming POSIX
implementation!

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698       http://www.awl.com/cp/butenhof/posix.html |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q96: Cleaning up when kill signal is sent to the thread.? 


> I'm writing a multi-threaded daemon, which requires some cleanup if a
> kill signal is sent to the thread.  I want just the thread that received
> the signal to exit.
> 
> The platform is Linux 2.0, libc 5.4.23, linuxthreads 0.6 (99% POSIX
> threads).
> 
> The docs indicate that threads share signal functions, but can
> individually block or accept certain signals.  This is workable -- but
> how do I get the thread id of the thread that received the signal?
> 
> And my next question, how portable are thread cleanup routines?
> 
> Thanks,
> 
> Jeff Garzik                                        Quality news feeds
> News Administrator                     INN Technical info, Consulting
> Spinne, Inc.                            http://www.spinne.com/usenet/

Jeff,

  From the sounds of what you say, the answer is "No."  :-)

  Meaning, don't do that.  There's a better method, cancellation.
If you really want the thread to exit asynchronously, that's the
way to do it.

  Now, it is even more likely that a simple polling routine will
do the job, and that would be even easier to write.

-Bil
(There's a nice little cancellation example on the web page below.)

=================================TOP===============================
 Q97: C++ new/delete replacement that is thread safe and fast? 

> [email protected] (Bob Pearson) writes:
> 
> > Our platform is Solaris 2.5.1 and I am looking for a commerical, freeware
> > or shareware C++ new/delete replacement that is thread safe and uses more
> > than a single mutex.  We have a multi-threaded application that is using a
> > tremendous amount of new operators and is huge (>200MB) and is constantly
> > running into very high mutex contention due to the single mutex for new in
> > libC.a from:
> >
> >       SUNWSpro "CC: SC4.0 18 Oct 1995 C++ 4.1".
> 
> You might want to check out ptmalloc:
> 
>    ftp://ftp.dent.med.uni-muenchen.de/pub/wmglo/ptmalloc.tar.gz
> 
> I would hope that operator new somehow invokes malloc at a lower
> level.  If not, you would have to write a small wrapper -- there
> should be one coming with gcc that you could use.
> 
> Hope this helps,
> Wolfram.
Wolfram Gloger 

=================================TOP===============================
 Q98: beginthread() vs. endthread() vs. CreateThread? (Win32) 

[Bil: Look at the description in "Multithreading Applications in Win32" (see
books.html)]

Mark A. Crampton wrote:
> 
> Juanra wrote:
> >
> > I'm a Windows 95 programmer and I'm developing a multithreaded
> > server-side application. I use the CreateThread API to create a new
> > thread whenever a connection request comes. I've read that it's better
> > to use beginthread() and endthread() instead of CreateThread because
> > they initialize the run time libraries. What happens with win32
> > CreateThread function?. Doesn't it work properly?. If not, I can't use
> > beinthread because I can't create my thread in a suspended mode and
> > release it after.
> >
> > Does the function beginthreadNT() work under win95?
> 
> No
> 
> >
> > Thanks in advance.
> > Juan Ra.
> 
> Answer to beginthread - use _beginthreadex, which uses same args as
> CreateThread (you can create suspended).  _beginthreadex works on 95 &
> NT but not Win32S.  The priviledge flags are ignored under 95.
> 
> CreateThread _works_ OK - it just doesn't free memory allocated on the C
> run-time library stack when the thread exists.  So you can attempt to
> clean up the runtime library stack, use _beginthreadex, or not use any C
> run time library calls.

=================================TOP===============================
 Q99: Using pthread_yield()? 

Johann Leichtl wrote:
> 
> if i have some code like:
> 
>         ..
>         pthread_yield()
>         something(e.g. lock mutex)
>         ..
> 
> is it guaranteed that the thread will give up the cpu before getting the
> lock or not.

First off, to clarify, (you probably already know this, given the set of
names in your subject line), "pthread_yield" is an obsolete DCE thread
interface, not part of POSIX threads. As such, it is not covered by any
formal standard, and has no real portability guarantees. The way it
works on your particular DCE thread system is probably the way the
developers wanted it to work on that system, and if you disagree there's
no "higher authority" to which you might appeal.

POSIX specifies the behavior of sched_yield, (or, in fact, any
scheduling operation), only with respect to the defined realtime
scheduling policies, SCHED_FIFO and SCHED_RR. Threads running under one
of these policies that call sched_yield will release the CPU to any
thread (in SCHED_FIFO or SCHED_RR) running at the same priority. (There
cannot be any at a higher priority, since they would have preempted the
current thread immediately.)

Is that the same thing as "guaranteed [to] give up the cpu"? For one
thing, sched_yield won't do anything at all if there are no other
threads that are ready to run at the calling thread's priority; it'll
just return.

If you have threads with non-standard scheduling policies, such as
SCHED_OTHER, or a hypothetical SCHED_TIMESHARE, POSIX says nothing about
the behavior or sched_yield. Most likely, (and at least in Digital's
implementation), the function will do the same thing. It doesn't really
worry about scheduling POLICY, only PRIORITY. Note that, because
SCHED_OTHER doesn't necessarily imply preemptive scheduling, you might
actually have a thread "ready to run" at a higher priority than the
current thread's priority. Also, because non-realtime policies aren't
necessarily "strictly priority ordered", and the system generally wants
to simulate some sort of fairness in timeshare scheduling, it is
possible (at least, "not ruled out by the standard") that a call to
sched_yield from a non-realtime thread might yield to a thread with
lower priority -- especially if that other thread is realtime.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698       http://www.awl.com/cp/butenhof/posix.html |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q100: Why does pthread_cond_wait() reacquire the mutex prior to being cancelled? 

Firstly, thanks to all of you who responded to my post in
comp.programming.threads.  As I alluded to in my posting, I felt quite sure
the problem I was experiencing was one due to misunderstanding.  I
immediately suspected this when my program exhibited the same behaviour
under HP-UX *and* Solaris.

Everyone told me the same thing: use a cleanup handler, pushed onto the
cleanup handler stack for the active thread because pthread_cond_wait
*reacquires* the mutex when it is cancelled.  I can see how this causes
other threads waiting on the same condition variable to fail to be
cancelled, but for me, the $64,000 question is:

        Why does pthread_cond_wait reacquire the mutex prior to being cancelled?

This seems like madness to me.  We're _cancelling_ the thread, so we're no
longer interested in the value of the data we're testing.  Why acquire the
lock and immediately require us to use a cleanup handler?

There must be something more to this ;-)

Thanks,
Ben
---
Ben Elliston                    .====    E-mail: [email protected]
Compucat Research Pty Limited  /  ====.     Web: http://www.compucat.com.au
Canberra ACT Australia         .====  /
                                 ====.
Ben,

  You're not thinking hard enough!  It *has* to be like this.

>         Why does pthread_cond_wait reacquire the mutex prior to being cancelled?
> 
> This seems like madness to me.  We're _cancelling_ the thread, so we're no
> longer interested in the value of the data we're testing.  Why acquire the
> lock and immediately require us to use a cleanup handler?

A:  pthread_cond_wait(m, c);
B:  do some work...
C:  pthread_mutex_unlock(m);

  If cancellation happened while sleeping (at A) or while running (at B),
the same cleanup handler would run.  If the state of the mutex was DIFFERENT
at those locations, you'd be up the creek.  Right?

-Bil
=================================TOP===============================
 Q101: HP-UX 10.30 and threads? 
o
Bryan Althaus wrote:
> 
> Jim Thomas ([email protected]) wrote:
> : In article <[email protected]> [email protected] (Bryan Althaus) writes:
> :
> : Bryan> This is actually becoming a really bad joke.  No HP people seem to want
> : Bryan> to talk about 10.30 though they will post on what compiler flags to
> : Bryan> use to compile an app using pthreads under 10.30!
> :
> : Bryan> And apparently once it does come out you must ask for it.  10.20 will be
> : Bryan> the shipping OS until HP-UX 11.0 comes out.
> :
> : Bryan> If anyone knows when 10.30 is shipping, please email me.  I'll respect
> : Bryan> your privacy and not repost.  We have a current need for threading,
> : Bryan> but since they must be spread out over multiple CPU's, kernel thread
> : Bryan> support is needed - hence HP-UX 10.30.
> :
> : I received an e-mail from "SOFTWARE UPDATE MANAGER" Saturday that says the
> : following.  Note especially the part about not for workstations :-(
> :
> : Jim
> :
> Thanks for the info Jim.  A received email from a kind soul at HP who
> basically explained the deal on 10.30, being used for people/ISV
> transitioning for HP-UX 11.0, but was not sure when it would be out.
> That was all I needed to know. Based on this we will use 10.20 for
> our product roll-out and when HP-UX 11.0 comes out maybe I revisit
> replacing the forking() code with threads.  As it turned out, on
> a two CPU machine, the forking() code actually worked nicely and
> basically was written as if I were using pthreads and wasn't really
> the big hack I thought it was going to be.  Of course each fork()
> costs an additional 70MB's of memory! :)
> 
> Now if someone could let us know roughly the due date for HP-UX 11.0
> and maybe what we can look for in HP-UX 11.0.  Obviously it will have
> kernel threads with pthreads API, NFS PV3, Streams TCP/IP, and support
> both 32 and 64 bit environments. Will HP-UX 11.0 ship more Net friendly
> with a Java Virtual Machine?  Will the JVM be threaded? Will JRE be
> on all HP-UX 11.0 systems?  WebNFS support? WebServer? Browser?  Current
> OS's now come with all these goodies standard: http:/www.sun.com/solaris/new

-- 
=================================TOP===============================
 Q102: Signals and threads are not suited to work together? 

Keith Smith wrote:
> 
> This is a question I posed to comp.realtime, but noticed that you have a
> discussion going on here....  can you offer me any assistance?
> HEre's the excert:
> 
> Shashi:
> 
> Based on your previous email (below), I have a couple of questions:
> 
> 1. If signals and threads are not suited to work together, what
> mechanism can/should be used to implement timing within a thread.  If I
> have two threads that performed autonomous time-based functions, I want
> to be able to have a per-thread timing mechanism.
> 
> 2. If the approach of "block all signals on all threads and send various
> signals to a process - putting the emphasis on the thread to unblock the
> appropriate signals", how do we deal with other threads which may be
> interrupted by a blocked signal (e.g. a read() call that returns EINTR
> even when its thread blocks the offending signal.  Isn't this a flaw?
> This requires the need for a signal handler (wastefull) with the RESTART
> option speicified.
> 
> It seems like a per-thread mechanisms is needed... how does NT
> accomplish this?
> 
> ** I know I shouldn't be relying on a per-LWP signal, but how else can I
> accomplish what I am trying to do?
> 
> In message " timer's interrupting system calls... HELP", [email protected]
> writes:
> 
> >Hi,
> >Signals and threads are not suited to work together. Personally, I feel that
> >UNIX has a serious flaw in the sense that most blocking systems calls (e.g.
> >read, semop, msgrcv etc) do not take a timeout parameter. This forces
> >programmers to use alarms and signals to interrupt system calls. I have
> >worked with operating systems such as Mach, NT and other which do not suffer
> >from this problem. This makes porting single threaded applications from UNIX
> >(which rely on signals) to a multithreaded process architecture difficult.
> >
> >Even though I come from an UNIX background (Bell Labs in late 80's) I have
> >learnt the hard way that signals make program much more error prone. I have
> >worked extensively on Mach and NT and never saw a reason to use threads. As
> >far as POSIX.1c is concerned I think they did a favor to the users of threads
> >on UNIX by mandating that signals be a per-process resource. You have to
> >understand that LWP is more of a System R4 concept (same on Solaris) and not
> >a POSIX concept. Two level scheduling is not common on UNIX systems (those
> >who implement have yet to show a clear advantage of two level scheduling).
> >
> >I am sure that Dave Butenhof (frequent visitor to this newsgroup) would have
> >more insight as to why POSIX did not choose to implement signals on a
> >per-thread basis (or LWP as you say). I would advice that you should
> >rearchitect your application not to depend on per-thread (LWP) signals. I
> >feel you will be better off in the long run. Take care.
> >
> >Sincerely,
> >Shashi
> >

=================================TOP===============================
 Q102: Patches in IRIX 6.2 for pthreads support? 

Jeff A. Harrell wrote:
> 
> radha subramanian wrote:
> >
> > I heard that a set of patches have to be applied in IRIX 6.2
> > for pthreads support.  Could someone tell me which are these
> > patches ?
> 
>  1404 Irix 6.2 Posix 1003.1b man pages          List    123Kb   07/01/97
>  1645 IRIX 6.2 & 6.3 POSIX header file updates  List    41Kb    07/01/97
>  2000 Irix 6.2 Posix 1003.1b support modules    List    164Kb   07/01/97
>  2161 Pthread library fixes                     List    481Kb   07/01/97
> 
> The whole set is downloadable from:
> 
> 
> 
> A SurfZone password is required.


=================================TOP===============================
 Q104: Windows NT Fibers? 

Ramesh Shankar wrote:
> 
> Found some info. on Windows NT Fibers in "Advanced Windows." Just
> wanted to verify whether my understanding is correct.
> 
> - Is a (primitive) "many to one" thread scheduling model.
> - Fibre corresponds to Solaris "threads" (NT threads then correspond
> to Solaris LWP).
> - If a fibre blocks, the whole thread (LWP for us) blocks.
> - Not as sophisticated as Solaris threads.

  Kinda-sorta.  Certainly close enough.  My understanding is that fibers
were built especially for a couple of big clients and then snuck their way 
out.  As such, I would avoid using them like the plague.  I've read the
APIs and they scare me.

-Bil
            ------------------------

Jeffrey Richter, Advanced Windows, 3rd Ed., p.971 states that

"The fiber functions were added to the Win32 API to help companies
quickly port their existing UNIX server applications to Windows NT."

(blah)

The following sentences say that fibers are targeted to the
proprietary user level thread-like quirks some companies did for
whatever reason (ease of programming, performance).

To answer your question: fibers are not an integral part of any MS
application, and I can't imagine that they use it internally anywhere,
and thus won't achive the stability. Does this argument weigh a bit
against their use in a new program?

Joerg

PS: Have you noticed that I managed to keep from flaming :-)

            -----------------
>> Fibers are BAD because they comprise a SECOND method of doing threading.
>> If you want threads, use threads. (All that co-routine stuff was
>> great. We don't need them any more.)
>
>There are two reasons for "threads" and things similar to threads.
>First, they're smaller than full blown processes and with faster
>context switching than with processes.  Second, they allow more fine
>grained concurrency.

I don't think you hit it quite on the head.  Threads, allow a computation
to be decomposed into separately scheduled tasks, which has these advantages:
- the tasks can be run on separate processors.
- the tasks can be prioritized, so that a less important computation
  can, in response to an external event, be suspended to process that event.
- computation can occur in one thread, while waiting for an event, such as
  the completion of I/O
So it's all about improving performance parameters like overall run time,
or average response time, or real time response, and maximizing the
utilization of real resources like processors and peripherals.

>Originally, coprocesses (and tasks and light-weight-processes and
>threads) solved both goals quite well.  Then in the last decade or
>more, thread-like things started getting bigger and slower; ie,
>letting the kernel handle the context switching, making them work well
>with OS calls and standard libraries, signal handling, asynchronous
>I/O, etc.
>
>Fibers seem like just a return to the efficient/small type of task.
>The drawback to them seems just that they're only on Windows NT, so
>that even if you have a valid need for them the code won't even be
>portable to other Windows boxes.

If you take a thread, and then hack it into smaller units that the
operating system doesn't know about, these smaller units do not
realize the advantages I listed above. They are not scheduled on
separate processors, they cannot be dispatched in response to
real-time inputs, they cannot wait for I/O while computation occurs.

I did not list, as one of the advantages, the ability to change the
logical structure of the program by decomposing it into threads, like
eliminate the implementation of state machines by offloading some state
information into individual program counters and stacks.  To me, that is
a purely internal program design matter that doesn't make any externally
visible difference to parameter like the running time, througput,
real-time response or average response.

It's also a programming language matter as well; a language with
continuations (e.g. Scheme) would have no need for these types of
sub-threads. In Scheme, a function return is done using a function
call to a previously saved continuation.  It's permitted to jump
into the continuation of a function that has already terminated;
the environment captured by the continuation is still available.
(Such captured environments are garbage collected when they become
unreachable).  To me, things like fibers seem like low-level hacks to
provide platform-specific coroutines or continuations to the C language,
whereas threads are a language-independent operating system feature.

>If Fibers are unnecessary because Threads exist, then why not say that
>Threads are unnecessary because Processes exist?
  (Threading comprises
>a SECOND method of splitting work up into separate units of control)

This argument assumes that threads are to processes what processes
are to the system. However according to one popular system model,
processes just become collections of resources that simply *have* one
or more independently scheduled control units. In this model, threads
are the only separate unit of control. A process that has one unit of
control is said to be single-threaded, rather than non-threaded.  Or,
under an alternative model exemplified by Linux, threads are just
collections of tasks that share resources in a certain way.  Two tasks
that don't share an address space, file table, etc are by convention
said to be in different processes. Again, there is just one method of
splitting work into units of control: the task.

=================================TOP===============================
 Q105: LWP migrating from one CPU to another in Solaris 2.5.1? 

Hej Magnus!

  Kanska...
 
> Hi!
> 
> I've got a question about threads in Solaris 2.5.1, that I hope You can
> answer for me!
> 
> Short version:
> How does the algorithm work that causes an LWP to migrate from one CPU to
> another CPU in Solaris 2.5.1?

  The LWP gets contexted switched off CPU 0.  When a different CPU becomes 
available, the scheduler looks to see how many ticks have passed.  Solaris 2.5:
if less than 4, some other LWP (or none at all!) gets the CPU.  If > 3, then
just put the LWP on the new CPU.
 
> Longer version:
> I'm doing some research about a tool that I hope could be used by multi-thread
> programmers in order to find and possibly correct perfomance bottlenecks.
> Basically the tool works in three phases:
> 1) By running the multi-threaded program on a single processor we create a
>    trace wich represent the behaviour of the program.
> 2) By simulating (or re-schedule) the trace on a multi-processor we can tell
>    wether the program has the desired speed-up or not.
> 3) The simulated "execution" is displayd graphically in order to show where
>    the performance bottlenecks are.

  This sounds good...
 
> I've got a problem when simulating a program that hits a barrier.
> Assume that we, for instance, have 8 bound threads hitting the same barrier
> on a multiprocessor with 7 processors. Here the migration for an LWP from
> one CPU to another is very important. If we have no migration at all the speed
> up will be 4 compared to a single processor.
> On the other hand, if we have full migration, the speed up will be (almost) 7
> if we neglect the impact of cache-misses.

  Of course said $ misses are a BIG deal.

  None-the-less...  I *think* this will happen on Solaris 2.5:

  The first 7 wake up and run for 1 tick (10ms).  The 7 drop 10 points of
priority.  T8 then gets CPU 7, while T1 - T6 run another tick.  They drop 10
points.  T7 wants CPU 7 and will get it from T8.  Now the time slice increases
because we're near the bottom of the priority scale.  Everybody runs for 10
ticks.  From here on out, one thread will migrate around while the others 
keep their CPUs.  I think.

  Of course you'd avoid writing a program that put 8 CPU-bound threads on 7
CPUs...

-Bil
=================================TOP===============================
 Q106: What conditions would cause that thread to disappear? 

William,

> I have a service thread which enters a never-exiting service loop via
> a while(1).  What conditions would cause that thread to disappear?

  You tell it to.  Either return(), pthread_exit(), or pthread_cancel().
That's the only way out.

> It can't be just returning off the end because of the while(1).  Past
> experience has indicated to me that if a single thread causes a
> exception such as a SEGV that the entire process is killed.  Are there
> known conditions which cause just the thread to exit without
> interfering with the rest of the process?

  You're right.  SEGV etc. kill the process (unless you replace the
signal handler).


> I suspect there's stack corruption in this thread, but I would have
> expected such corruption to take the form of a SEGV or something
> similar.  I'm very surprised that just the thread exited leaving
> everything else (seemingly) intact.

  So...  you have a problem.  I *expect* that you'll find the place
where the thread's exiting and it'll be something you wrote.  (The
other option is a library bug.  Always possible (if unlikely).)

  I'm disappointed to see that a breakpoint in pthread_exit() doesn't
get called in the Sun debugger.  Moreover, you don't even get to 
see the stack from the cleanup handlers!  (I'm making this a bug
report.)  I notice that from TSD destructors you at least get to
see a bit of the call stack.

  So...  I'd suggest this:  Declare some TSD, put a breakpoint in
the destructor, and see what happens when your thread exits.  Try
out the bit of code below.

  How does this work on other platforms?

cc -o tmp1 tmp1.c  -g -lpthread
*/

#define _POSIX_C_SOURCE 199506L
#include 
#define NULL 0

pthread_attr_t  attr;
pthread_t   thread;
pthread_key_t   key;

void destroyer(void *arg)
{pthread_t tid = pthread_self();
 printf("T@%d in TSD destructor.\n", tid);
}


void cleanup(void *arg)
{pthread_t tid = pthread_self();
 printf("T@%d in cleanup handler.\n", tid);
}

void search_sub2()
{ 
 pthread_exit(NULL);        /* Surprise exit -- the one you forgot about */
}

void search_sub1()
{ 
 search_sub2();     /* do work */
}


void *search(void *arg)
{
  pthread_setspecific(key, (void *) 1234);   /* NEED A VALUE! */
 pthread_cleanup_push(cleanup, NULL);
 search_sub1();     /* do work */
 pthread_cleanup_pop(1);
 pthread_exit(NULL);
}


main()
{

  pthread_key_create(&key;, destroyer);
 pthread_attr_init(&attr;);
 pthread_attr_setscope(&attr;, PTHREAD_SCOPE_SYSTEM);
 pthread_attr_setdetachstate(&attr;, PTHREAD_CREATE_JOINABLE);/* Also
default */

 pthread_create(&thread;, &attr;, search, NULL);

 pthread_exit(NULL);
}



=================================TOP===============================
 Q107: What parts, if any, of the STL are thread-safe? 


Matt Austern wrote:
> 
> Boris Goldberg  writes:
> 
> > > >I'm finding a memory leak in the string deallocate() (on the call to
> > > >impl_->deallocate()) under heavy thread load, and it brings up a
> > > >frightening question:
> > >
> > > >What parts, if any, of the STL are thread-safe?
> > >
> > 
> > STL thread safety is implementation-dependent. Check with
> > your vendor. Many implementations are not thread-safe.
> 
> One other important point: "thread safety" means different things to
> different people.  Programming with threads always involves some
> cooperation between the language/library and the programmer; the
> crucial queston is exactly what the programmer has to do in order to
> get well-defined behavior.
> 
> See http://www.sgi.com/Technology/STL/thread_safety.html for an
> example of an STL thread-safety policy.  It's not the only conceivable
> threading policy, but, as the document says, it is "what we believe to
> be the most useful form of thread-safety."

=================================TOP===============================
 Q108: Do pthreads libraries support cooperative threads? 


Paul Bandler wrote:
> 
> Bryan O'Sullivan wrote:
> >
> > p> Thanks for those who have sent some interesting replies (although
> > p> no-one seems to think its a good idea to not go all the way with
> > p> pre-emptive pthreads).
> >
> > This is because you can't go halfway.  Either you use pthreads in a
> > fully safe manner, or your code breaks horribly at some point on some
> > platform.
> 
> OK, so you would disagree with the postings below from Frank Mueller and
> David Butonhof in July that indicates it is possibe (if inadvisable)?
> 
> Frank Mueller wrote:
> >
> >[email protected] (Schumacher Raphael, GD-FE64) >writes:
> >[deleted...]
> > > 1) Do pthreads libraries support cooperative threads?
> >
> > In a way, somewhat. Use FIFO_SCHED and create all threads at the same priority level,
> > and a thread will only give up control on a blocking operation, e.g. yield, cond_wait,
> > mutex_lock and (if thread-blocking is supported) maybe on blocking I/O (read, write, accept...)
> >
> > This may be close enough to what you want. Short of this, you probably need your coop_yield, yes.
> 
> On HP-UX, at least until 10.30 (which introduces kernel thread support),
> the SCHED_FIFO [note, not "FIFO_SCHED"] scheduling policy workaround
> might work for you, because your threads won't face multiprocessor
> scheduling. I wouldn't recommend it, though -- and of course it won't
> come even close to working on any multiprocessor system that supports
> SMP threads (Solaris, Digital UNIX, IRIX, or even the AIX draft 7
> threads). If you're interested in thread-safety, go for thread-safety.
> While it might be nice to give yourself the early illusion that your
> known unsafe code is running, that illusion could be dangerous later! If
> you've got a real need to run the software sooner than you can convert
> it, you're likely to run into other problems (such as the order in which
> threads run?) If you don't have an immediate need, why look for
> shortcuts that you know are only temporary?
> 
> If you really want to build a "cooperative scheduling" package for your
> threads, (and again, I don't recommend it!), build your own. It's not
> that hard. Inactive threads just block themselves on a condition
> variable until requested to run by some other thread (which signals the
> condition variable and then blocks itself on the same, or another,
> condition variable).
> 
> The "1)" in the original mail implies the first item of a list, but my
> news server has chosen, in its infinitesimal wisdom, not to reveal the
> original post to me. So perhaps I'll have more to say should it repent!
>  
> /---------------------------[ Dave Butenhof ]--------------------------\
> | Digital Equipment Corporation                   [email protected] |
> | 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
> | Nashua NH 03062-2698       http://www.awl.com/cp/butenhof/posix.html |
> \-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q109: Can I avoid mutexes by using globals? 

> > j> Now, I have implemented this without using a synchronization
> > j> mechanism for the integer. Since I have only one writer, and
> > j> multiple readers, and the datum is a simple integer, I believe I
> > j> can get away with this.

> But on the other hand why not do it correctly with locks? Locks
> will make the code easier to maintain because it will be somewhat
> self documenting (the call to rwlock() should give most programmers
> a clue) and it will be more robust. In my experiance threaded
> programs are more fragile and more difficult to debug than single
> threaded programs. It is a good idea to keep thread syncronization
> as controlled as you can, this will make debugging simpler.

Remember that sign in "The Wizard of Oz"?

  "I'd go back if I were you."




When you port your program to the next OS or platform and a new
bug appears...  Could it be caused by your hack?  Won't it be
fun trying to guess with each new bug?  How will you prove to yourself
that the bug is elsewhere?

And that guy who maintains your code, won't he have fun?




That's three opinions...

"Just the place for a snark! I have said it three times.
And what I say thrice is true!"

-Bil

=================================TOP===============================
 Q110: Aborting an MT Sybase SQL? 


Bryan Althaus wrote:
> 
> Jim Phillips  wrote:
> : We are using Sybase on Solaris for a database application.
> 
> : We are trying to abort a query that turns out to be long running by
> : using POSIX pthread() in a manner which allows a calling X-Windows
> : program to interrupt a fetch loop and cancel the query.
> 
> : At runtime, it works OK sometimes and sometimes it doesn't.  If we
> : attempt to start a new query too soon after canceling the long running
> : query, we get the remains of the old query's result set.  If we wait a
> : couple of minutes before starting the new query,  then it works fine and
> : the new queries expected result set is returned.
> 
> : We are using ESQL and Solaris 2.5 C compiler to build the Sybase SQL
> : Server 11.0.2 program interface.
> 
> : I have heard a rumor that you cannot use pthread with some X-11
> : versions?
> 
> : Anybody out there have any ideas, thoughts, comments or critique(s).
> 
> Having come from a Sybase OpenServer class, I can tell you that you can't
> "cancel" a query.  Once it is sent to Sybase, the query will run till
> completion.  There is no way to stop Sybase from executing the entire
> query.
> 
> I'm not saying that's your problem, just that I notice you say if you
> wait a couple of minutes before starting the new query all is fine.
> Obviously by then the old query has finished.
> 
> I haven't used ESQL in years (use C++/DBtools.h++) so I don't know how your
> connection to Sybase is done. When you "cancel" do you close the old connection,
> and then open up a new connection when you start a new query?
> 
> Just sounds like you may be using the same connection and getting back the
> first query to finish, which would be the "cancelled" one.
> 
> You might try comp.databases.sybase if this theory is at all likely.
> 
> In any case I'd be interested in what the problem turns out to be.
> 
> Good luck,
> Bryan Althaus
> 


=================================TOP===============================
 Q111: Other MT tools? 


The Etch group at UW and Harvard is distributing some program
development and analysis tools based on the Etch Binary
Rewriting Engine for NT/x86 boxes.

The first of these tools is a call graph profiler that helps 
programmers understand where time is going in their program.
The Etch Call Graph Profiler works on Win32 apps built by most
compilers, does not require source, works with multithreaded
programs, understands program and system DLLs, and is designed
to run reasonably fast even if your machine only has an 'average'
configuration. We've used it on serious programs ranging from
Microsoft's SQL server to Lotus Wordpro, and even on Direct Draw
games like Monster Truck Madness.

If you'd like to give our profiler a try, you can download it from:

    http://etch.cs.washington.edu

Follow the link to Download Etch Tools.

Part of our motivation for distributing these tools is to get
some feedback about what we've built and what we should be
building. So please give our tools a try, and send mail to
[email protected] with your comments and
suggestions.

Enjoy!

The Etch Group

    Ted Romer
    Wayne Wong
    Alec Wolman
    Geoff Voelker
    Dennis Lee
    Ville Aikas
    Brad Chen
    Hank Levy
    Brian Bershad
                    

=================================TOP===============================
 Q112: That's not a book. That's a pamphlet! 

Brian Silver wrote:
> Ben Self wrote:
> > 
> [Snip]
> > Since this is a response largely to Brian Silver's post, a person I have
> > a good deal of respect for, I have chosen to include some quotes form
> > Dave Butenhof's book, Programming with POSIX Threads, because we both
> > know it and have a mutual admiration for his work.
> 
> Flattery will get you everywhere!
> 
> (But what makes you think I admire Dave's work!? His code sure is
> formatted nice, though!  He also puts useful comments in
> his code. Except for the thd_suspend implimentation in his book.)

Brian, after all, likes to tell me that I've published a "pamphlet",
because Addison-Wesley chose to use a soft cover, and "books" have hard
covers. For those who may not have figured all this out, by the way,
Brian's office is diagonally across from mine -- he's currently a
contractor working on Digital's threads library.

As for the suspend example in the book... 'tis true, it is not well
documented. Of course, Brian should have been a little more cautious,
since, as a footnote documents for posterity, the example is Brian's
code, essentially as he wrote it. (And, as Brian just admitted, he
doesn't tend to comment as well as I do. ;-) )


Dave Butenhof 

=================================TOP===============================
 Q113: Using recursive mutexes and condition variables? 


I have a question regarding recursive mutexes and condition variables.

Given a mutex created with one of the following attributes:

DCE threads
    pthread_mutexattr_setkind_np( &attr;, MUTEX_RECURSIVE_NP );

X/Open XSH5 (UNIX98)
   pthread_mutexattr_settype( &attr;, PTHREAD_MUTEX_RECURSIVE );

What exactly is the behavior of a pthread_cond_wait() and what effect do
"nested" locks have on this behavior?

Do mixing recursive locks and condition variables make any sense?

This is largely achademic.  However, I (like everyone else in known
space/time) maintain an OO abstraction of a portable subset of Pthreads
and would like to know the appropriate semantics.  

Since the advent of the framework (about 3 years ago) I have managed to
avoid using recursive mutexes.  Unfortunately, my back may be against
the wall on a few new projects and I may be forced to use them.  They
seem to be a real pain.

And yes I promise this concludes my postings on cvs for the forseen
future.


thanks,


--ben

=================================TOP===============================
 Q114: How to cleanup TSD in Win32? 

>I am forced to use TSD in multithreading existing code. I just ran
>into the problem that, while the destructor function argument to
>pthread_key_create() and thr_keycreate() appears ideal, there is no
>such facility with the NT TlsAlloc() and that related stuff.

It's pretty easy to write the code to provide this facility. Basically
you have to wrap TlsAlloc() and TlsFree() in some additional logic.
This logic maintains a list of all the keys currently allocated for
your process. For each allocated key it stores the address of the
destructor routine for that key (supplied as an argument to the
wrapped version of TlsAlloc()). When a thread exits it iterates
through these records; for every key that references valid data (i.e.
TlsGetValue() returns non-NULL) call the relevant destructor routine,
supplying the address of the thread specific data as an argument.

The tricky part to all this is figuring out when to call the routine
that invokes the destructors. If you have complete control over all
your threads then you can make sure that it happens in the right
place. If, in the other hand, you are writing a DLL and you do not
have control over thread creation/termination this whole approach gets
rather messy. You can do most of the right stuff in the
DLL_THREAD_DETACH section of DllMain() but the thread that attached
your DLL will not take this route, and trying to clean up TLS from
DLL_PROCESS_DETACH is dangerous at best.

Good luck.


Gilbert W. Pilz Jr.
Systems Software Consultant
[email protected]

=================================TOP===============================
 Q115: Onyx1 architecture has one problem 
Hi there,

I made some parallel measurements on SGI's.

It seemed that the Onyx1 architecture has one problem:
the system bus, which introduces a communication bottleneck.
Memory access and float point calculation introduces traffic
on that bus.

Measurements on the new Onyx2 crossbar-based architecture
suggested that these problems would be solved. However, some early
measurements suggested two thoughts:
1. Float point calculation scales better on the Onyx2 architecture,
which suggests that this problem was really communication related.
(-> crossbar). Going beyond 4 processores (more than one crossbar),
 the scaling goes down.

2. Memory allocation:
Memory allocation (basically a sequential operation) is *really*
slow. Most time is spend at the locking mechanism. This surprises
me, because the pthread mutices I'm using in the code are called
at least as much as the memory allocation, but they are much faster.

Does anybody at SGI has some hints to explain this behaviour?

Thanks,

Dirk
-- 

======== 8< ======= 8< ======= 8< ======= 8< ======= 
Dirk Bartz                  University of Tuebingen

Dirk Bartz  writes:

> 2. Memory allocation:
> Memory allocation (basically a sequential operation) is *really*
> slow. Most time is spend at the locking mechanism.

I have noticed this as well, albeit with the `old' sproc-threads on
Irix-5.3.  ptmalloc seems to be an order of magnitude faster in the
presence of multiple threads on that platform:

  ftp://ftp.dent.med.uni-muenchen.de/pub/wmglo/ptmalloc.tar.gz

However, for Irix-6 with pthreads, you have to use a modified
ptmalloc/thread-m.h file, as I've recently discovered.  I will send
you that file by mail if you're interested; it will also be in the
next ptmalloc release, due out RSN.

Regards,
Wolfram.

=================================TOP===============================
 Q116: LinuxThreads linked with X11 seg faults. 

Unfortunately the X11 libraries are not compiled with -D_REENTRANT, hence
the problems. You can get the source for the X11 libraries and rebuild them
with the -D_REENTRANT flag and that should help.

If you are using Motif you are out of luck. I spoke to the folks who supply
motif for RedHat Linux. They refused to give me a version recompiled with
the -D_REENTRANT version. They gave me a load of crap about having to test
it and so forth.

I tried using LessTif, but it seemed to be missing too much.

Neil

=================================TOP===============================
 Q117: Comments about Linux and Threads and X11 

> LinuxThreads linked with X11 by g++ causes calls to the X11 library to seg
> fault.

You can either use Proven's pthread package or LinuxThreads.

Proven's is a giant replacment for the standard libraries that
does user level threads in a single process. 

LinuxThreads uses the Operating system clone() call to implement
threads as seperate processes that share the same memory space.
LinuxThreads seems to be tricky to install as it requires new
versions of the standard libraries in addition to a 2.x kernel and
the pthread library. However, if you get the latest version of RedHat,
you're all set.

I've found Proven's implementation to be much faster, though somewhat
messier to compile and a bit incomplete in it's system call
implementation (remember, it has to provide a substitute for almost
everys system call). Unfortunately I had to switch to LinuxThreads
because the signal handling under Proven's threads was not working
properly.

In particular, disk performance seems to suffer under LinuxThreads.
As far as I can tell, the OS level disk caching scheme gets confused
by all the thread/processes that are created.

It's also a bit unnerving typeing "ps" and seeing fourty copies
of your application running!

...

Unfortunately the X11 libraries are not compiled with -D_REENTRANT, hence
the problems. You can get the source for the X11 libraries and rebuild them
with the -D_REENTRANT flag and that should help.

If you are using Motif you are out of luck. I spoke to the folks who supply
motif for RedHat Linux. They refused to give me a version recompiled with
the -D_REENTRANT version. They gave me a load of crap about having to test
it and so forth.

I tried using LessTif, but it seemed to be missing too much.

Neil



=================================TOP===============================
 Q118: Memory barriers for synchonization 

Joe Seigh wrote:
> 
> So there are memory barriers in mutexes, contrary to what has been stated
> before in this newsgroup.  Furthermore, it appears from what you are saying is
> that the mutex lock acts as a fetch memory barrier and the mutex unlock
> acts as a store memory barrier, much like Java's mutex definitions.
> Which is not suprising.  Java appears to have carried over quite a bit of the
> POSIX thread semantics.

This is not QUITE correct. First off, the semantic of locking or unlocking a
mutex makes no distinction regarding read or write. In an architecture that
allows reordering reads and writes, neither reads nor writes may be allowed
to migrate beyond the scope of the mutex lock, in either direction. That is,
if the architecture supports both "fetch" and "store" barriers, you must
apply the behavior of both to locking AND unlocking a mutex.

The Alpha, for example, uses MB to prevent reordering of both reads and
writes across the "barrier". There's also a WMB that allows read reordering,
but prevents write reordering. WMB, while tempting and faster, CANNOT be
used to implement (either lock or unlock of) a POSIX mutex, because it
doesn't provide the necessary level of protection against reordering.

Finally, let's be sure we're speaking the same language, ("Gibberishese").
People use "memory barrier" to mean various things. For some, it means a
full cache flush that ensures total main memory coherency with respect to
the invoking processor. That's fine, but it's stronger than required for a
mutex, and it's not what *I* mean. The actual required semantic (and that of
the Alpha) is that a "memory barrier" controls how memory accesses by the
processor may be reordered before reaching main memory. There's no "flush",
nor is such a thing necessary. Instead, you ensure that data (reads and
writes) cannot be reordered past the lock, in either of the processors
involved in some transaction.

An MB preceding the unlock of a mutex guarantees that all data visible to
the unlocking processor is consistent as of the unlock operation. An MB
following the lock of the mutex guarantees that the data visible to the
locking processor is consistent as of the lock operation. Thus, unlocking a
mutex in one thread does not guarantee consistent memory visibility to
another thread that doesn't lock a mutex. Coherent memory visibility, in the
POSIX model, for both readers and writers, is guaranteed only by calling
specific POSIX functions; the most common of which are locking and unlocking
a mutex. A "memory barrier", of any sort, is merely one possible hardware
mechanism to implement the POSIX rules.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP===============================
 Q119: Recursive mutex debate 

Robert White wrote:
> 
> I STRONGLY DISAGREE with the idea that recursive mutexes "are a bad idea".
> 
> I have made and use a recursive mutex class in several key C++ endeavors.  As a
> low-level tool recursive mutexes are "bad" in that they tend to lead the sloppy
> down dangerous roads.  Conversly, in experienced hands an recursive mutex is a
> tool of simple elegance.  The core thing, as always, is "knowing what you are
> doing".

Hey, look, recursive mutexes aren't illegal, they're not "morally
perverse", and with XSH5 (UNIX98) they're even standard and portable.
So, fine -- if you like them, you use them. Use them as much as you
like, and in any way you like.

But remember that they're ALWAYS more expensive then "normal" mutexes
(unless your normal mutexes are more expensive than they need to be for
the platform!). And remember that WAITING on a condition variable using
a recursively locked mutex simply won't work. So, if you're using a
condition variable to manage your queue states, you need to at least
analyze your lock usage sufficiently to ensure that the wait will work.
And, once you've done that, it's a simple step to dropping back to a
normal mutex.

There are definitely cases where the expense is acceptable, especially
when modifying existing code -- for example, to create a thread-safe
stdio package. The performance isn't "extremely critical", and you don't
need to worry about condition wait deadlocks (there's no reason to use
them in stdio). Sorting out all of the interactions between the parts of
the package is difficult, and requires a lot of new coding and
reorganization -- and implementing some of the correct semantics gets
really tricky.

Don't waste time optimizing code that's not on the critical path. If
you've got code that's on your critical path, and uses recursive
mutexes, then it's NOT optimized. If you care, you should remove the
recursive mutexes. If you don't care, fine. If the use of recursive
mutexes in non-critical-path code doesn't put it on the critical path,
there's no reason to worry about them.

Still, I, personally, would use a recursive mutex in new code only with
extreme reluctance and substantial consideration of the alternatives.

/---------------------------[ Dave Butenhof ]--------------------------\

[I echo Dave's "extreme reluctance and substantial consideration of the 
 alternatives" -Bil]
=================================TOP===============================
 Q120: Calling fork() from a thread 

> Can I fork from within a thread ?

Absolutely.

> If that is not explicitly forbidden, then what happens to the other threads in
> the child process ?

There ARE no other threads in the child process. Just the one that
forked. If your application/library has background threads that need to
exist in a forked child, then you should set up an "atfork" child
handler (by calling pthread_atfork) to recreate them. And if you use
mutexes, and want your application/library to be "fork safe" at all, you
also need to supply an atfork handler set to pre-lock all your mutexes
in the parent, then release them in the parent and child handlers.
Otherwise, ANOTHER thread might have a mutex locked when one thread
forks -- and because the owning thread doesn't exist in the child, the
mutex could never be released. (And, worse, whatever data is protected
by the mutex is in an unknown and inconsistent state.)

One draft of the POSIX standard had included the UI thread notion of
"forkall", where all threads were replicated in the child process. Some
consider this model preferable. Unfortunately, there are a lot of
problems with that, too, and they're harder to manage, because there's
no reasonable way for the threads to know that they've been cloned. (UI
threads allows that blocking kernel functions MAY fail with EINTR in the
child... but that's not a very good basis for a recovery mechanism.)
After much discussion and gnashing of teeth and tearing of hair, the
following draft removed the option of forkall.

> Is there a restriction saying that it's OK provided the child immediately does
> an exec ?

Actually, this is the ONLY way it's really safe, unless every "facility"
in the process has proper and correct forkall handling to protect all of
the process state across the fork.

In fact, despite the addition of forkall handlers in POSIX 1003.1c, the
standard specifically says that the child process is allowed to call
only async signal safe functions prior to exec. So, while the only real
purpose of forkall is to protect the user-mode state of the process,
you're really not guaranteed that you can make any use of that state in
the child.

> What if I do this on a multiprocessor machine ?

No real difference. You're more likely to have "stranded" mutexes and
predicates, of course, in a non-fork-safe process that forks, becuase
other threads were doing things simultaneously. But given timeslicing
and preemption and other factors, you can have "other threads" with
locked mutexes and inconsistent predicates even on a uniprocessor.

Just remember, that, in a threaded process, it's not polite to say "fork
you" ;-)

/---------------------------[ Dave Butenhof ]--------------------------\
> David Butenhof wrote:
>> 
>> The "UI thread" version of fork() copies ALL threads in the child. The
>> more standard and reasonable POSIX version creates a child process with a
>> single thread -- a copy of the one that called fork().
>> 
> Sorry to ask...what do you mean by `the "UI thread" version of fork()'?
> I'm a little confused here.

Alright, if you're only "a little confused", then we haven't done our jobs. 
We'll try for "very confused", OK? Let me know when we're there. ;-)

First, the reference to "UI threads" may have seemed to come out of the 
blue if you're new to this newsgroup and threads; so let's get that out of 
the way. "UI" was a committee that for a time controlled the direction and 
architecture of the System V UNIX specification. (UNIX International.) The 
thread interfaces and behavior they defined (which was essentially what Sun 
had devised for Solaris, modified somewhat along POSIX lines in places) are 
commonly known as "UI threads". (Or sometimes "Solaris threads" since they 
originated on Solaris and aren't widely available otherwise.)

The UI thread definition of fork() is that all threads exist, and continue 
execution, in the child process. Threads that are blocked, at the time of 
the fork(), in a function capable of returning EINTR *may* do so (but need 
not). The problem with this is that fork() in a process where threads work 
with external resources may corrupt those resources (e.g., writing 
duplicate records to a file) because neither thread may know that the 
fork() has occurred. UI threads also has fork1(), which creates a child 
containing only a copy of the calling thread. This is equivalent to the 
POSIX fork() function, which provides a more controlled environment. (You 
can always use pthread_atfork() handlers to create daemon threads, or 
whatever else you want, in the child.)
 
=================================TOP===============================
 Q121: Behavior of [pthread_yield()] sched_yield() 

> > I have a question regarding POSIX threads on Linux and Solaris. The
> > program below compiles and links well on both systems, but instead of the
> > expected "100000, " it always prints out
> > "100000, 0", so the thread is not really ever started.
> 
> 
> 
> Well, both sets of output are legal and correct for the code you supplied.

Yes, this is correct.

> First, you see [p]thread_yeild does not say "give control to another thread"
> it says "if there is another thread that can be run, now might be a good time
> do do that".  The library is under no obligation to actually yeild.  (there
> is a good explaination of this elsewhere in this group, but it has to do with
> the fact that you are running under SCHED_OTHER semantics which are
> completely unspecified semantics, go figure.)

Just for clarity...

    pthread_yield is an artifact of the obsolete and crufty old
    DCE thread implementation (loose interpretation of the 1990
    draft 4 of the POSIX thread standard). It doesn't exist in
    POSIX threads.

    thr_yield is an artifact of the UI threads interface, which
    is, (effectively though not truly), Solaris proprietary.

    sched_yield is the equivalent POSIX function.

As Robert said, POSIX assigns no particular semantics to the SCHED_OTHER
scheduling policy. It's just a convenient name. In the lexicon that we
developed during the course of developing the realtime and thread POSIX
standards, it is "a portable way to be nonportable". When you use
SCHED_OTHER, which is the default scheduling policy, all bets are off.
POSIX says nothing about the scheduling behavior of the thread.
(Although it does require a conforming implementation to DOCUMENT what
the behavior will be.)

Because there's no definition of the behavior of SCHED_OTHER, it would
be rather hard to provide any guarantees about the operation of the
sched_yield function, wouldn't it?

If you want portable and guaranteed POSIX scheduling, you must use the
SCHED_FIFO or SCHED_RR scheduling policies (exclusively). And, of
course, you need to run on a system that supports them.

> Next, the number of threads in a (POSIX) program does not necessarily say
> anthing about the number of actual lightweight processes that will be used to
> execute the program.  In your example there is nothing that "forcably" causes
> the main thread to give up the processor (you are 100% CPU related) so your
> first thread runs through to completion.  An identically arranged ADA program
> (which wouldn't quite be possible 8-) would have equally unstable results.
> (I've seen students write essentially this exact program to "play with" tasks
> and threads in ADA and C, but the program is not valid in any predictable
> way.)

POSIX doesn't even say that there's any such thing as a "light weight
process". It refers only obliquely to the hypothetical concept of a
"kernel execution entity", which might be used as one possible
implementation mechanism for Process Contention Scope thread scheduling.

> Finally, POSIX only says that there will be "enough" LWPs at any moment to
> ensure that the program as a whole "continues to make progress".

That's not strictly true. All POSIX says is that a thread that blocks
must not indefinitely prevent other threads from making progress. It
says nothing about LWPs, nor places any requirements upon how many there
must be.

> When you do the SIGINT from the keyboard you are essentially causing the
> "current" thread to do a pthread_exit/abort. Now there is only one thread
> left, the "second" one, so to keep the program progressing that one get's the
> LWP from the main thread.  That is why you see the second start up when you
> do a "^C"...

SIGINT shouldn't "do" anything to a thread, on a POSIX thread system. IF
it is not handled by a sigaction or a sigwait somewhere in the process,
the default signal action will be to terminate the process (NOT the
thread).

It's not clear from the original posting exactly where the described
results were seen: Linux or Solaris? My guess is that this is Linux,
with the LinuxThreads package. Your threads are really cloned PROCESSES,
and I believe that LinuxThreads still does nothing to properly implement
the POSIX signal model among the threads that compose the "process".
That may mean that, under some circumstances, (and in contradiction to
the POSIX standard), a signal may affect only one thread in the process.
The LinuxThreads FaQ says that SIGSTOP/SIGCONT will affect only the
targeted thread, for example. Although it also says that threads "dying"
of a signal will replicate the signal to the other threads, that might
not apply to SIGINT, or there might be a timing window or an outright
hole where that's not happening in this case.

LinuxThreads is, after all, a freeware thread package that's from all
reports done an excellent job of attacking a fairly ambitious goal. A
few restrictions and nonconformancies are inevitable and apparently
acceptable to those who use it (although it's gotta be a portability
nightmare for those who use signals a lot, you're always best off
avoiding signals in threaded programs anyway -- a little extra
"incentive" isn't a bad thing). If you see this behavior on Solaris,
however, it's a serious BUG that you should report to Sun.

> The very same program with a single valid "operational yeild" (say reading a
> character from the input device right after the pthread_create()) will run at
> 100% CPU forever because it will never switch *OUT* of the second thread.

At least, that's true on Solaris, where user threads aren't timesliced.
To get multiple threads to operate concurrently, you need to either
manually create additional LWPs (thr_setconcurrency), or create the
threads using system contention scope (pthread_attr_setscope) so that
each has its own dedicated LWP. Solaris will timeslice the LWPs so that
multiple compute-bound threads/processes can share a single processor.
LinuxThreads directly maps each "POSIX thread" to a "kernel thread"
(cloned process), and should NOT suffer from the same problem. The
kernel will timeslice the "POSIX threads" just as it timeslices all
other processes in the system. On Digital UNIX, the 2-level scheduler
timeslices the user ("process contention scope") threads, so, if a
compute-bound SCHED_OTHER thread runs for its full quantum, another
thread will be given a chance to run.

> In essence there is no good "Hello World" program for (POSIX) threads (Which
> is essentially what you must have been trying to write 8-).  If the threads
> don't interact with the real world, or at least eachother, the overall
> program will not really run.  The spec is written to be very responsive to
> real-world demands.  That responsiveness in the spec has this example as a
> clear degenerate case.

That's not true. "Hello world" is easy. If the thread just printed
"Hello world" and exited, and main either joined with it, or called
pthread_exit to terminate without trashing the process, you'd see
exactly the output you ought to expect, on any conforming POSIX
implementation. The problem is that the program in question is trying to
execute two compute-bound threads concurrently in SCHED_OTHER policy:
and the behavior of that case is simply "out of scope" for the standard.
The translation of which is that there's no reasonable assumption of a
portable behavior.

/---------------------------[ Dave Butenhof ]--------------------------\


=================================TOP===============================
 Q122: Behavior of pthread_setspecific() 

> Can you explain the discrepancy between your suggestion and the
> following warning, which I found in the SunOS 5.5.1 man page for
> "pthread_setspecific".
> 
> ******************************************************************
> WARNINGS
>      pthread_setspecific(),                pthread_getspecific(),
>      thr_setspecific(),  and  thr_getspecific(),  may  be  called
>      either explicitly, or implicitly from a thread-specific data
>      destructor function.  However, calling pthread_setspecific()
>      or thr_setspecific() from a destructor may  result  in  lost
>      storage or infinite loops.
> 
> SunOS 5.5.1         Last change: 30 Jun 1995                    4
> ******************************************************************
> 
> I'm not sure how an infinite loop might occur, while using
> "pthread_setspecific" in a destructor.  Do you know the answer?

We're talking about two different things.

1) What the standard says, which is that the destructor is called, and
   may be called repeatedly (until a fixed, implementation specified
   limit, or forever), until the thread-specific data values for the
   thread become NULL. Because the standard doesn't say that the
   implementation is required to clear the value for each key as the
   destructor is called, that requirement is, implicitly, placed on the
   application. (This oversight will be corrected in a future update
   to the standard.)

   In order to set the value to NULL, you clearly must call the function
   pthread_setspecific() within the destructor. Note that setting the
   value to NULL within the destructor will work either with the current
   standard (and the current LinuxThreads literal implementation) AND
   with the fixed standard (and most other implementations, which have
   already implemented the correct semantics, figuring that an infinite
   loop usually is not desirable behavior).

2) The correct POSIX semantics, which are implemented by Solaris and
   Digital UNIX. (Probably also by IRIX, HP-UX, and AIX, although I
   haven't been able to verify that.) The Solaris manpage warning is
   imprecise, however. There's no problem with a destructor explicitly
   setting the value to NULL. The warning SHOULD say that setting a
   thread-specific data value to any non-NULL value within a destructor
   could lead to an infinite loop. Or, alternately, to a memory leak, if
   the new value represents allocated heap storage, and the system has
   a limit to the number of times it will retry thread-specific data
   destruction.

/---------------------------[ Dave Butenhof ]--------------------------\

 Q123: Linking under OSF1 3.2: flags and library order  

Joerg Faschingbauer wrote:

> Hi,
>
> recently I posted a question about the correct linking order under
> Solaris 2.4. Got some valuable hints, thanks.
>
> I have a similar problem now, this time under OSF1 3.2. Can anybody
> tell me if the following is correct? I could not find any hints on
> that topic in the man pages.
>
> gcc ... -ldnet_stub -lm -lpthreads -lc_r -lmach
>
> Does pthreads need stuff from c_r, or the other way around? Do I need
> mach at all? Do I need dnet_stub at all?

In a threaded program prior to Digital UNIX 4.0, EVERYTHING needs
libc_r, because libc is not thread-safe. Yes, the thread library
requires libmach, and, because of bizarre symbol preemption requirements
(which, for trivia junkies, were at one time required by OSF for "OSF/1"
branding), if you don't include libmach explicitly things might not work
out right. You must specify libmach BEFORE libc_r. You don't need
-ldnet_stub unless YOU need it (or some other library you're including).
We certainly don't use it.

The best way to build a threaded program on 3.2 is to use "cc -threads".
If you're going to use gcc, or an older cxx that doesn't support
"-threads", or if you need to use ld to link, then the proper expansion
of "-threads" is:

     for compilation:
          -D_REENTRANT
     for linkage:
          -lpthreads -lmach -lc_r

The linkage switches must be the LAST libraries, exclusive of libc. That
is, if you were using ld to link, ...

     ld <.o files...> -lpthread -lmach -lc_r -lc crt0.o

I don't believe the position of -lm with respect to the thread libraries
will matter much, since it's pretty much independent. If you use -lm
-threads, however, libm will precede the thread libraries, and that's a
good standard to follow.

A side effect of "-threads" is that ld will automatically look for a
reentrant variant of any library that you specify. That is, if you
specify "-lfoo", and there's a "libfoo_r", ld will automatically use
libfoo_r. If you don't use -threads, you'll need to check /usr/shlib (or
/usr/lib if you're building non-shared) for reentrant variants.

Note that, to compile a DCE thread (draft 4) threaded program once you
move to Digital UNIX 4.0 or higher, the compilation expansion of
-threads will need to be changed to "-D_REENTRANT -D_PTHREAD_USE_D4",
and the list of libraries should be "-lpthreads -lpthread -lmach -lexc".
There's no libc_r on 4.0 (libc is fully thread-safe), and you need
libexe since we've integrated with the standard O/S exception mechanism.
Note the distinction between libpthread (the "core" library implementing
POSIX threads), and libpthreads (the "legacy" library containing DCE
thread and CMA wrapper functions on top of POSIX thread functions).

Minor additional notes: as of Digital UNIX 4.0D we've dropped the final
dependencies on the mach interfaces, so libmach is no longer required
(you'll get smaller binaries and faster activation by omitting it once
you no longer need to support earlier versions). And, of course, once
you've moved to 4.0 or later, you should port to POSIX threads, in which
case you can drop -lpthreads and -D_PTHREAD_USE_D4.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q124: What is the TID during initialization?  

Lee Sailer wrote:

> In a program I am "maintaining", there is a
>
>    foo = RWThreadID();
>
> call at global scope.  Conceptually, this gets called before main().
> Does this seem OK?  Can this code rely on the "thread" being used to do
> initial construction to be the same as the "main thread"?

[For example, the .init sections of libraries run before main() starts. -Bil]

While the assumption will likely be true, most of the time, it strikes me
as an extremely dangerous and pointless assumption. There are a lot of
reasons why it might NOT be true, sometimes, on some platforms, under some
circumstances. There's no standard or rule of ettiquette forbidding a
difference. Even if the "thread" is the same, the "thread ID" might change
as things get initialized.

I recommend avoiding any such assumptions.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP===============================
 Q125: TSD destructors run at exit time... and if it crashes?  

Sebastien Marc wrote:

> On Solaris you can associate a function (called destructor) that will be
> called at the termination of the thread, even if it crashes.

Almost. Both POSIX and UI threads interfaces include thread-specific data.
When you create a thread-specific data (TSD) key, you can specify a
destructor function that will be run when any thread with a non-NULL value
for that key terminates due to cancellation [POSIX only] or voluntary thread
exit (return from the thread's start routine, or a thread exit call --
pthread_exit or thr_exit).

Yes, you can use that as a sort of "atexit" for threads, if you make sure
that each thread uses pthread_setspecific/thr_setspecific to SET a non-NULL
value for the TSD key. (The default value is NULL, and only the thread itself
can set a value.)

However, that doesn't help. There is simply no way that a thread can "crash"
without taking the process with it. A unhandled signal will never terminate a
thread -- either the signal is ignored, or it does something to the process
(stop, continue, terminate). TSD destructors are NOT run:

   * on the child side of a fork
   * in a call to exec
   * in process termination, regardless of whether that termination is
     voluntary (e.g., a call to exit) or involuntary (an unhandled signal).

In all those cases, threads quietly "evaporate", leaving no trace of their
existence. No TSD destructors, no cleanup handlers, nothing. Gone. Poof.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP===============================
 Q126: Cancellation and condition variables  

Marcel Bastiaans wrote:

> Anyone:
>
> I appear to be missing something in my understanding of how condition
> variables work.  I am trying to write a multithreaded program which is
> portable to various platforms.  I am unable to cancel a thread if it is
> waiting on a condition variable which another thread is waiting on also.
> The problem can easily be reproduced on both Solaris 2.5 and HP-UX 10.10.  A
> simple program which demonstrates my problem is shown below.  This program
> sample uses the HP-UX pthreads library but the problem also appears when
> using Solaris threads on Solaris 2.5.

In any case... yes, you are missing something. The program, as written, will
hang on any conforming (or even reasonably correct) implementation of either
DCE threads or POSIX threads. (To put it another way, any implementation on
which it succeeds is completely broken.)

> Is there a problem in this program which I don't understand?  I cannot use
> cleanup handlers because not all platforms support them.  Any help would be
> greatly appreciated.

If you can use cancellation, you can use cleanup handlers. Both are part of
both DCE threads (what you're using on HP-UX 10.10) and POSIX threads (what you
probably are, and, at least, should be, using on Solaris 2.5.) If you've got
cancellation, and you don't have cleanup handlers, you've got an awesomely
broken implementation and you should immediately chuck it.

When you wait on a condition variable, and the thread may be cancelled, you
MUST use a cleanup handler. The thread will wake from the condition wait with
the associated mutex locked -- even if it was cancelled. If the thread doesn't
then unlock the mutex before terminating, that mutex cannot be used again by
the program... it will remain locked by the cancelled thread.

> #include 
> #include 
> #include 
>
> pthread_cond_t cond;
> pthread_mutex_t mutex;
>
> void * func(void *)
> {
>    // Allow this thread to be cancelled at any time
>    pthread_setcancel(CANCEL_ON);
>    pthread_setasynccancel(CANCEL_ON);

Serious, SERIOUS bug alert!! DELETE the preceding line before proceeding with
this or any other program. Never, ever, enable async cancelation except on
small sections of straight-line code that does not make any external calls.
Better yet, never use async cancel at all.

In any case, you absolutely CANNOT call any POSIX (or DCE) thread function with
async cancellation enabled except the ones that DISABLE async cancel. (For
bizarre and absolutely unjustifiable reasons [because they're wrong], POSIX
threads also allows you to call pthread_cancel -- but don't do it!)

>    // Wait forever on the condition var
>    pthread_mutex_lock(&mutex;);
>    for(;;) {
>       pthread_cond_wait(&cond;, &mutex;);
>    }
>    pthread_mutex_unlock(&mutex;);
>    return 0;
> }

I suspect your problem is in cancelling the second thread. As I said,
cancellation terminates the condition wait with the associated mutex locked.
You're just letting the thread terminate with the mutex still locked. That
means, cancelled or not, the second thread can never awaken from the condition
wait. (At a lower level, you could say that it HAS awakened from the condition
wait, but is now waiting on the mutex... and a mutex wait isn't cancellable.)

The answer is... if you use cancellation, you must also use cleanup handlers.
(Or other, non-portable equivalent mechanisms, such as exception handlers or
C++ object destructors... on platforms where they're implemented to
interoperate with cancellation. [Both Solaris and Digital UNIX, for example,
run C++ destructors on cancellation.])

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q127: RedHat 4.2 and LinuxThreads?  

> > The Linux kernel has supported multithreading for a very long time.
> 
> thank you for the info, Bill.  the man page for clone that ships with
> Red Hat 4.2 states that clone does not work.  here are my questions.
> they all relate to Red Hat 4.2:
> 
> 1. does clone work for all defined parameter values?
> 2. where can i find a list of the c library api's that are not
> reentrant?
> 3. does RedHat 4.2 install LinuxThreads if "everything" is selected?
> 
> 
> > Until recently, the API definition for POSIX thread support was
> > contained in the LinuxThreads package, but that's just a wrapper
> > around the kernel's built=in functioning.  With the releace of libc6
> > (GNU libc) the LinuxThreads functionality is more tightly integrated
> > into the basic C library,
> 
> do you mean that the POSIX thread api's are now in libc so that
> LinuxThreads is obsolete?

With the glibc2 (2.0.5 c is current I think) LinuxThreads is
obsolete. However, you have to get yourself the additional
glibc-linuxthreads package, but that's detail.

AFAIK glibc2 is still in the beta stadium, but it works quite
well. Moreover, it is recommended to use glibc2 for multithreading
rather than libc5. As H.J.Lu, libc5's maintainer, once stated: "I'm
surprised it works at all" (or so).

You can install a "beta" of the libc6 (aka glibc2) as a secondary C
library against which you link your program, and keep the good old
libc5 as the primary library which the system related programs use.

Take a look at 

http://www.imaxx.net/~thrytis/glibc/

for HOWTOs etc.

Joerg
----------------------------------------------------------------------------
Joerg Faschingbauer                                     [email protected]
Voice: ++43/316/820918-31                            Fax: ++43/316/820918-99
----------------------------------------------------------------------------

=================================TOP===============================
 Q128: How do I measure thread timings?  
Andy Sunny wrote:

> I'm conducting some research to measure the following things about
> pthreads using a Multikron II Hardware Instrumentation Board from NIST
> 1) thread creation time (time to put thread on queue)
> 2) thread waiting time (time that thread waits on queue)
> 3) thread execution time (time that thread actually executes)
>
> Are there any decent papers that explain the pthreads run time system
> and scheduling policy in DETAIL? I have read Frank Mueller's (FSU) paper
> and am trying to obtain the standard from IEEE. What is the latest
> version of the standard and will it help me find the proper libraries
> and functions need to measure the above items?

The standard is unlikely to be of any help to you. It says nothing at all
about implementation. POSIX specifies SOURCE-LEVEL interfaces, and
describes the required portable semantics of those interfaces.
Implementation details are (deliberately, properly, and necessarily) left
entirely to the creator of each implementation. For example, there's no
mention of libraries -- an embedded system, for example, might include all
interfaces in an integrated kernel; and that's fine.

What you need is a document describing the internal implementation details
of the particular system you're using. If the vendor can't supply that,
you'll need to create it yourself -- either by reading source, if you can
get it, or by flailing around blindly in the dark and charting the walls
you hit.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q129: Contrasting Win32 and POSIX thread designs  
Arun Sharma wrote:

> On Mon, 24 Nov 1997 18:10:13 GMT, Christophe Beauregard wrote:
>
>         c>  while thread context is a Windows concept.
>
> How so ? pthreads don't have contexts ?

This looks like an interesting discussion, of which I've missed the
beginning. (Perhaps only the followup was cross-posted to
comp.programming.threads?) Anyway, some comments:Anything has "context".
A thread is an abstraction of the executable state traditionally
attributed to a process. The "process" retains the non-executable state,
including files and address space. Why would anyone contend that "thread
context is a Windows concept"? I can't imagine. Maybe it's buried in the
unquoted portion of the orignal message. And then again, some people
think Microsoft invented the world.

>         c> Generally, you'll find that pthreads gives you less control
>         c> over how a thread runs.  There are very good reasons for
>         c> this (one being portability, another being safety).
>
> In other words, it has to be the least common denominator in the
> fragmented UNIX world. No wonder people love NT and Win32 threads.

POSIX threads gives you far more real control over threads than Win32
(for example, far superior realtime scheduling control). What it doesn't
give you is suspend/resume and uncontrolled termination. Those aren't
"control over how a thread runs". They are extraordinarily poor
programming mechanisms that can almost never be used correctly. Yes, to
some people the key is "almost never", and one may argue that they should
be provided anyway for that 0.001% of applications that "need" it. (But
those of us who actually support threaded interfaces might also point out
that these extremely dangerous functions are for some reason particularly
tempting to beginners who don't know what they're doing -- resulting in
very high maintenance costs, which mostly involves helping them debug
problems in their code.)

This isn't an example of "fragmented UNIX" -- it's UNIX unity, with a
wide variety of different "UNIX camps" reaching a concensus on what's
necessary and useful.

While the Win32 interface comprises whatever the heck a few designers
felt like tossing in, POSIX was carefully designed and reviewed by a
large number of people, many of whom knew what they were doing. Omitting
these functions was a carefully considered, extensively discussed, and
quite deliberate decision. The Aspen committee that designed the thread
extensions to POSIX for the Single UNIX Specification, Version 2,
proposed suspend/resume -- they were later retracted by the original
proposer (with no objections). A POSIX draft standard currently under
development, 1003.1j, had proposed a mechanism for uncontrolled
termination, with the explicit recognition that it could be used (and
then only with extreme care) only in carefully constructed embedded
systems. It, too, was later withdrawn as the complications became more
obvious. (The notion that you can regain control of a process when you've
lost control of any one thread in the process is faulty, because all
threads depend completely on shared resources. If you've lost control of
a thread, you don't know the state of the process -- how can you expect
it to continue?)

>         c> Basically, using signals for dealing with threads is a Bad
>         c> Thing and people who try generally get screwed.
>
> It doesn't have to be so. That's an implementation problem.

Yes, it does have to be so, because signals are a bad idea to begin with.
Although there were enormous complications even before threads, the
concept becomes all but unsupportable with the addition of full
asynchronous execution contexts to the traditional process.

The "synchronous" signals, including SIGSEGV, should be language
exceptions. The other "asynchronous" signals should be handled
synchronously in independent contexts (threads). If you think about it,
that's what signals were attempting to do; the condition exists as a
separate execution context (the signal handler). Unfortunately, a signal
preempts the hardware context of the main execution context,
asynchronously. That's a really, really bad idea. Although people have
always casually done things like calling printf in signal handlers, too
few people realize that's always been incorrect and dangerous -- only a
small list of UNIX functions are "async-signal safe". The addition of
threads, however, allowing the process to have multiple contexts at any
time, increases the chances that some thread will be doing something that
will conflict with improper use of non async-signal safe functions at
signal level.

> Portability doesn't necessarily have to cripple the API.

And, in fact, it doesn't. It results in a well-designed and robust
interface that can be efficiently implemented everywhere. I'm not arguing
that the POSIX interface is perfect. There is room for additions, and the
Single UNIX Specification, Version 2, makes a good start. Other areas to
consider for future standardization would include debugging and analysis
interfaces. There are POSIX standards in progress to improve support for
"hard realtime" environments (for example, putting timeouts on all
blocking functions to control latency and help diagnose failures).

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q130: What does POSIX say about putting stubs in libc?  

Patrick TJ McPhee wrote:

> I'd like to know what Posix has to say about putting stubs in libc.
> Is it permitted? Is it required? What return code can we expect to
> receive from such a stub, and how can we portably ignore it?

POSIX doesn't have ANYTHING to say. POSIX 1003.1 doesn't really recognize the
existance of the concept of a "library". It defines a set of SOURCE LEVEL
interfaces that shall be provided by implementations and that may be used by
applications to achieve certain portable semantics. Now, 1003.2 says a little
about libraries. That is, there's something with a ".a" file suffix, and
there's a utility called "ar" to create them, and a utility called "c89" with
a "-l" switch that may read from a "lib.a" (and may also read
additional file suffixes, e.g., .so). 1003.2 doesn't say anything about which
symbols may or should be resolved from which libraries, though, and hasn't
been updated since 1003.1c-1995 (threads), in any case.

So, if your system's  provides a definition for _POSIX_THREADS, then
1003.1c-1995 tells you that you can potentially call pthread_create. It does
not tell you which libraries you need to link against. UNIX98 does specify
that, for c89, the proper incantation is "-lpthread". But even that's not the
same as a requirement that the symbols must resolve from, and only from, a
libpthread library: only that you're not allowed to build a portable threaded
application without REQUESTING libpthread. (And if you use cc instead of c89,
UNIX98 doesn't help you any more than POSIX, aside from the gentle SUGGESTION
that an implementation provide the thread implementation in a libpthread --
which had, in any case, already become the defacto industry standard.)

So, yes, it's "permitted", and, no, it's not "required".

If you're building an APPLICATION using threads, there's no confusion or
problem. You build according to the rules of the platform, and you've got
threads, wheresoever they might reside. If you try to use threads without
building properly, all bets are off, because you blew it. If you're getting
the interfaces accidentally from somewhere else, that's nobody's fault but
your own.

If you're trying to build thread-safe code that doesn't use threads, you've
got a portability problem. No standard will help you accomplish this. That's
too bad. Requiring libc "stubs" would be one way out -- but as I've already
said, (and as I'll reiterate in the next paragraph!), the Solaris
implementation has some serious limitations of which I don't approve. I would
not consider that an acceptable standard. I'm not entirely happy with our own
solution (a separate set of "tis" interfaces), either, because extra
interfaces are nobody's friend. One might say that there is room here for
innovation. ;-)

If you're trying to build a library that uses threads, regardless of whether
the main program uses threads -- well, you're in trouble again. You SHOULD be
able to simply build it as if you were building a threaded application, and it
should work. Unfortunately it won't work (portably) unless the main program is
linked against the thread library(s), whether or not it needs them. Symbol
preemption will work against you if there are "stubs" for any functions in a
library that will be searched by direct dependencies of the main program.
(Even if your library is searched first, ITS direct dependencies will go at
the end of the list.) That's the problem with the Solaris libc stubs. (I'd
like to say that Digital UNIX avoids this, and that's certainly the intent;
but unfortunately it's not yet entirely true. Although there are no stubs
conflicting with libpthread, we depend on the libexc exception library, which
has a conflicting stub in libc. Luckily, this affects relatively few
operations -- but, technically, it still means it doesn't work.)

On the other hand, your final question is easy. There's no need to "portably
ignore" the errors that a stub might generate. Look, if you try to create a
thread, it either succeeds or it fails. You get back 0, and it worked.
Anything else, and it failed. If the failure is EAGAIN, you might choose to
try again later. Otherwise... hey, you're just not going to create that
thread, so deal with it. The only question is: can you live with that? If you
don't NEED to create a thread, go on with life, single threaded. If you NEED
to create the thread, then you're done. (Whether you return a failure to your
caller, or abort the process, probably depends on what you're trying to do,
and how your interface is designed.) It really doesn't matter whether you got
activated against libc stubs or a real thread library that for some reason
refuses to create the thread. You're not going to do what you wanted to do,
and that's that.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q131: MT GC Issues  

[See Geodesic Systems (www.geodesic.com)            -Bil]

Sanjay Ghemawat wrote:

> All collectors I have known of and read about control both the allocation
> and the deallocation of objects.  So it is fairly easy for them to grab
> all of the locks required before suspending threads.  The only problem
> here might be locks held within the operating system on behalf of a thread
> that is about to be suspended.  Even here, if one is using a thread interface
> like the one provided by Mach,  a call to "thread_abort" will pop the
> thread out of the OS.

There is no general or portable mechanism equivalent to
thread_abort, and it is pretty limited even on Mach. (First,
you have to know it's in a blocking Mach call.)

> >Furthermore, suspend/resume may possibly be necessary for concurrent garbage
> >collection (I haven't yet been convinced of that -- but I haven't found a good
> >alternative, either), but it's definitely far from desirable. It's an ugly and
> >stupendously inefficient kludge. Remember, you have COMPLETELY STOPPED the
> >application while garbage collecting. That's GOOD? Parallel applications want
>
> First of all, most incremental/concurrent collectors only stop the
> application while they find all of the pointers sitting in thread
> stacks/registers/etc.  The collectors that provide better real-time
> guarantees tend to make other operations (such as storing a pointer
> onto a stack) expensive.  I think there are two classes of systems
> here: hard real-time and others.  A good performance tradeoff for
> systems that do not require hard real-time bounds is to use an
> incremental/concurrent collector that may introduce pauses, but does
> not slow down the mutator with lots of book-keeping work.  So I think
> the argument that suspend/resume are bad only applies to some systems,
> not all.  Probably not even the vast majority of systems that run on
> desktops and commercial servers.

Look, if you want concurrency, you DON'T want pauses. I
already acknowledged that there might not be an alternative,
and that we certainly don't know of any now.  Maybe it's the
best tradeoff we can get. Stopping all the threads is still
bad.  Necessity does not equate to desirability. If you
really think that it is BENEFICIAL to stop the execution of
a concurrent process, argue on. Otherwise, drop it.

> >application while garbage collecting. That's GOOD? Parallel applications want
> >concurrency... not to be stopped dead at various unpredictable intervals so
> >the maintenance staff can check the trash cans. There has gotta be a better
> >way.
>
> So we should all wait and not use garbage-collection in multi-threaded
> programs until that better way is found?  I don't see why you are so
> vehemently set against suspend/resume.  It solves real problems for
> people implementing garbage collection.  Yes there are tradeoffs
> here: suspend/resume have their downside, but that doesn't mean we
> should ignore them.

Because suspend and resume are poor interfaces into the
wrong part of a scheduler for external use. They are
currently used (more or less) effectively for a small class
of specialized applications (e.g., garbage collection). They
are absolutely inappropriate for general use. The fact that
a bad function "can be used effectively" doesn't mean it
should be standardized. Standards do not, and should not,
attempt to solve all possible problems.

> So do that.  I don't think there are an unbounded number of such
> required operations.  In fact I have implemented a collector for a
> multi-threaded environment that requires just three such operations
> suspend, resume, and reading registers from a suspended thread.
> And folding in the register-state extraction into the suspend call
> seems like a fine idea.

Now, are we talking about "garbage collectors", or are we
talking about suspend and resume? All this rationalization
about garbage collection spills over naturally into a
discussion of suspend and resume -- but that's happenstance.
Sure, our concurrent GC systems, and especially Java, use
suspend/resume. But that's "because it was there", and
solved the problem of pinning down threads long enough to
get their state. But the function required of concurrent
garbage collection is not "suspend all threads and collect
their registers, then resume them". The required function is
"acquire a consistent set of live data root pointers within
the process".

Yes, there are a bounded set of operations required for GC
-- and that has nothing at all to do with suspend or
resume. If the argument for standardizing suspend and resume
is to revolve entirely around the needs of today's
semi-concurrent GC, then we should be designing an interface
to support what GC really needs, not standardizing an ugly
and dangerous overly-generalized scheduling function that
can be (mis-)used to implement one small part of what GC
needs.

> >> Since the implementation basically needs to be there for any platform
> >> that supports Java, why not standardize it?  Alternatively, if there is
> >> a really portable solution using signals, I would like to see it
> >> advertised.
> >
> >Any "why not" can be rephrased as a "why". Or a "so what".
>
> Oh come on.  By this argument, you could do away with all standards.
> The reason for standards is so that multiple implementations with
> the same interface can exist and be used without changing the
> clients of the interface.

And by the converse, perhaps we should standardize every
whim that comes into anyone's head? Baloney. We should
standardize interfaces that are "necessary and
sufficient". Determining exactly of what that consists is
not always easy -- but it's important because standards have
far- and long-reaching consequences.

> I apologize if this message has come across as a bit of a rant, but
> I am tired of people assuming that everyone who asks for suspend/resume
> must be an idiot who does not understand the available thread
> synchronization mechanisms.  There are legitimate uses for
> suspend/resume, of course with some performance tradeoffs.  By
> making the decision to not implement them in a thread library,
> you are taking away the ability of the clients of the library
> to decide on the tradeoff according to their need.  That makes
> the thread library less useful.

I guess we're even -- because I'm tired of hearing people
insist that because they want suspend/resume, it must be
universally accepted as "a cool thing" and forced down
everyone's throat. It's not a cool thing. And, by the way, I
have yet to hear of a single truly legitimizing use. The use
of suspend and resume by GC is an expedient hack. It isn't
really accomplishing what GC needs. It's far more
heavyweight than required, (as you pointed out, a GC system
suspends threads to get a consistent set of root pointers,
NOT because it wants to suspend the threads), and it doesn't
provide half the required capabilities (after all, the real
goal is to get the pointers -- the registers).

As for your final dig, I'm tempted to laugh. You know what
makes a thread library less useful? Providing unsupportable
functions that are nearly impossible to use safely and that
therefore result in significant support costs, preventing
the development team from doing work that would provide
useful features and fixing real problems.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP===============================
 Q132: Some details on using CMA threads on Digital UNIX  

[email protected] wrote:

> I'm trying to port code from and HP that used the cma threads package to
> a DEC Alpha with the posix package. I've found that some of the standard
> header files (DEC C++) have conflicting definitions (e.g., sys/types.h
> and pthreads_exc.h). Has anyone encountered this porblem and is there
> some simple conversion utility or a better library to use is such a port.

A number of questions present themselves immediately, including:

  1. What version of Digital UNIX are you using?
  2. Are you trying to compile with CMA, DCE thread (draft 4 POSIX), or true
     POSIX?
  3. What compiler/link options are you specifying?
  4. What headers do you include?
  5. What is actually happening?

A few comments:

  1. Digital UNIX provides both DCE thread (draft 4 POSIX, both "standard"
     and exception-raising) and CMA interfaces, as in any DCE
     implementation. To use these, compile with "cc -threads" or "cxx
     -threads", and link using the same. If you can't, compile with
     "-D_REENTRANT". Link depends on what version you're using --
     specifically 3.2(x) or 4.0(x). (And for link, watch out for _r
     libraries, e.g., libfoo_r.so or libfoo_r.a if linking static --
     "-threads" or "-pthread" will cause the linker to find and use them
     automatically; but if you roll your own you'll need to look for them
     yourself.)
  2. Digital UNIX 4.0 (and higher) also provides true POSIX threads, if
     you're converting. Compile and link using "cc -pthread" or "cxx
     -pthread". If you can't, compile with "-D_REENTRANT" and link with
     "-lpthread -lexc -lc" (at the end of your list of files). (And, again,
     watch out for _r suffix libraries.)
  3. You mentioned "pthreads_exc.h". Well,  is a header used
     to define the exception-raising variant of the DCE thread (draft 4
     POSIX) API. This conflicts with the  implication of your statement
     "with the posix package", since DCE threads are NOT the same as POSIX
     threads. You cannot use  with POSIX threads.

/---------------------------[ Dave Butenhof ]--------------------------\


=================================TOP===============================
 Q133: When do you need to know which CPU a thread is on?  

[This is part on an ongoing series of unsolved problems where there
is a lot of "We don't quite know WHY this is happening, but..."  -Bil]

On Sun, 28 Dec 1997, Bil Lewis wrote:

> Jason,
> 
>   That sounds like a very interesting project.  I'm curious about your decision
> to bind threads to CPUs.  You SAY you need to do it, but you don't give any
> proof.  Did you test you system without binding to CPUs?  What kind of results
> did you get when you did?
> 

The threaded version of the system has not been constructed yet however
a non-threaded (ie forked version) has and we have found significant
performance differences between allowing the processes to arbitrarily
migrate between processors and locking the processes to dedicated
processors.

So from that experience it stands to reason that locking threads to
processors would be preferable if we were to attempt to implement
a fully threaded version of the system.

>   I infer from what you say that this is a computationally intensive task.
> Which implies that the threads (or processes) would never migrate to different
> CPUs anyway.  DID they migrate?  I'd very much like to know your experience and
> the performance behavior.

Yes the graphics processes are computationally intensive. It is a standard
technique on multiprocessor SGI's to lock rendering processes to processors.
If they are not locked they will migrate.

The ability to lock threads to processors hasn't been fully implemented
by SGI yet. Currently since threads are bound to their processes, when
the process migrates  the thread gets carried along with it.
I'm guessing that pThreads on the SGI's are being implemented on top of sproc
which is a superset of the capabilities of pthreads. Since sprocs
can be locked to processors I'm hoping soon that the SGI implementation
of pthreads will also inherit that capability.

        =================================TOP=
Jason:
> Actually in the work we do (Virtual Reality) we crucially need to know not
> only which processor a thread is running on, but to be able to explicitly
> assign a thread to the processor.

Now I don't see any of that.

You have a set of threads that you want to execute in parallel on an SMP. That's
fine. Lots of people have the same need for all sorts of reasons. That, however,
does NOT mean that you need to know on which processor each thread is running,
much less be able to specify on which processor each thread runs. It just means
you need to be sure that the O/S supports parallel computation.

What you're saying is that you don't trust the O/S scheduling at all, and insist
on controlling it yourself. There are cases where that's valid -- but that's
quite different from staying that your application inherently requires processor
identification or control. It doesn't. In nearly ever case requiring
concurrency/parallelism, you'll be best off trusting the O/S to schedule the
processor resources. And if you find that it's not always trustworthy, tell the
developers, and help them fix it! You, and everyone else, will end up with a
better system.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP=============
 Q134: Is any difference between default and static mutex initialization?  

Robert White wrote:

> Venkat Ganti wrote:
>
> > I want to know  whether there is any difference between the following
> > two mutex initializations using pthreads
> >
> > 1.
> >
> > pthread_mutex_t  mp = PTHREAD_MUTEX_INITIALIZER;
> >
> > 2.
> >
> > pthread_mutex_t mp;
> > pthread_mutex_init (∓, NULL);
> >
> > In this the allocaled memory is zero.
> >
>
> An other way that these two may be different (in addition to the ones
> mentioned by Dave B. in his reply) is that the latter form can have
> different meaning as the program progresses because the default mutex
> behavior of a program can be changed with the set-attribute calls (I
> forget the exact call) when the attribute sepsified in the call is the
> NULL pointer.

You can't change the attribute values of the NULL attributes object. When
you initialize a mutex using NULL, you're asking for default attributes --
those MUST ALWAYS be the same attributes that will be used by a statically
initialized mutex. It doesn't (and can't) matter when the statically
initialized mutex is first used.

> If you use variant 2, you know that the semantics are those in-force at
> the time the statement is executed.  If you use variant 1, it will likely
> have the default semantics in force at the time the mutex is first used.

The only way this could be true is if an implementation provides some
non-portable and non-standard mechanism for modifying the default
attributes. You'd have a hard time convincing me that such an extension
could conform, since pthread_mutex_init specifically requires that the mutex
gain "default" attributes, and the standard requires that the default value
of any attributes (for which the standard doesn't specify a default) must be
specified in the conformance document.

> The manual, if I recall correctly, "strongly suggests" that variant 1
> only be used to initalize staticly allocated mutexes only.  I suspect
> that the above ambiguity is the reason.

Initializing a mutex on the stack is almost always bogus, and will usually
lead to far more trouble than you ever might have imagined. Doesn't matter
whether the mutex is statically initialized or dynamically initialized,
though, except (as always), a static initialization has no choice but to use
the default attributes.

You can't statically initialize a heap mutex, because the language doesn't
allow you to specify an initial value in that case.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP=============
 Q135: Is there a timer for Multithreaded Programs?  


From: [email protected] (Richard Sullivan)
Subject: Re: Timing Multithreaded Programs (Solaris)

[email protected] (Bradley J. Marker) wrote:

>I'm trying to time my multithreaded programs on Solaris with multiple 
>processors.  I want the real world running time as opposed to the total 
>execution time of the programming because I want to measure speedup versus 
>sequential algorithms and home much faster the parallel program is for the user.

Bradly,

  Here is what I wrote to solve this problem (for Solaris anyway).  To
use it just call iobench_start() after any setup that you don't want
to measure.  When you are done measuring call iobench_end().  When you
want to see the statistics call iobench_report().  The output to
stderr will look like this:

Process info:
  elapsed time  249.995
  CPU time      164.446
  user time     152.095
  system time   12.3507
  trap time     0.661235
  wait time     68.6506
  pfs    major/minor    3379/     0
  blocks input/output      0/     0
 
65.8% CPU usage

>>>>>>>>>>>>>>>>>>>>> iobench.h

/*-----------------------------------------------------------------------------
 *
 * Library Name: UTIL
 * Module Name:  iobench
 *
 * Designer:    R. C. Sullivan 
 * Programmer:  R. C. Sullivan
 * Date:        Sep 22, 1995
 *
 * History Of Changes:
 *      Name         Date            Description
 *      ----         ----            -----------
 *      RCS     Jan 17, 1996     Inital release
 *
 * Purpose:
 *   To report resource usage statistics that will be correct for
 * programs using threads on a Solaris system.
 *
 * Notes:
 *
 *-----------------------------------------------------------------------------
 */

extern struct prusage prusagebuf_start, prusagebuf_end;
extern int procfd;
extern double real_time, user_time, system_time, trap_time, wait_time;
extern unsigned long minor_pfs, major_pfs, input_blocks, output_blocks, iochars;

void iobench_start();
void iobench_end();
void iobench_report();

>>>>>>>>>>>>>>>>>>>>> iobench.c

/*-----------------------------------------------------------------------------
 *
 * Library Name: UTIL
 * Module Name:  iobench
 *
 * Designer:    R. C. Sullivan
 * Programmer:  R. C. Sullivan
 * Date:        Sep 22, 1995
 *
 * History Of Changes:
 *      Name         Date            Description
 *      ----         ----            -----------
 *      RCS     Jan 17, 1996     Inital release
 *
 * Purpose:
 *   To report resource usage statistics that will be correct for
 * programs using threads on a Solaris system.
 *
 * Notes:
 *
 *-----------------------------------------------------------------------------
 */

#include 
#include 
#include 
#include 
#include 

#include "iobench.h"

struct stat statbuf;
struct prusage prusagebuf_start, prusagebuf_end;
int procfd;
double real_time, total_real_time, user_time, system_time, trap_time, wait_time;
unsigned long minor_pfs, major_pfs, input_blocks, output_blocks, iochars;

void iobench_start() {
  char pfile[80];

  sprintf(pfile, "/proc/%ld", getpid());
  procfd = open(pfile, O_RDONLY);

  ioctl(procfd, PIOCUSAGE, &prusagebuf;_start);
}

void iobench_end() {
  ioctl(procfd, PIOCUSAGE, &prusagebuf;_end);
  close(procfd);

  real_time = (double) prusagebuf_start.pr_tstamp.tv_sec +
        (double) prusagebuf_start.pr_tstamp.tv_nsec / NANOSEC;
  real_time = (double) prusagebuf_end.pr_tstamp.tv_sec +
         (double) prusagebuf_end.pr_tstamp.tv_nsec / NANOSEC - real_time;

  total_real_time = (double) prusagebuf_start.pr_rtime.tv_sec +
         (double) prusagebuf_start.pr_rtime.tv_nsec / NANOSEC;
  total_real_time = (double) prusagebuf_end.pr_rtime.tv_sec +
         (double) prusagebuf_end.pr_rtime.tv_nsec / NANOSEC - real_time;

  user_time = (double) prusagebuf_start.pr_utime.tv_sec +
         (double) prusagebuf_start.pr_utime.tv_nsec / NANOSEC;
  user_time = (double) prusagebuf_end.pr_utime.tv_sec +
         (double) prusagebuf_end.pr_utime.tv_nsec / NANOSEC - user_time;

  system_time = (double) prusagebuf_start.pr_stime.tv_sec +
         (double) prusagebuf_start.pr_stime.tv_nsec / NANOSEC;
  system_time = (double) prusagebuf_end.pr_stime.tv_sec +
         (double) prusagebuf_end.pr_stime.tv_nsec / NANOSEC - system_time;

  trap_time = (double) prusagebuf_start.pr_ttime.tv_sec +
         (double) prusagebuf_start.pr_ttime.tv_nsec / NANOSEC;
  trap_time = (double) prusagebuf_end.pr_ttime.tv_sec +
         (double) prusagebuf_end.pr_ttime.tv_nsec / NANOSEC - trap_time;

  wait_time = (double) prusagebuf_start.pr_wtime.tv_sec +
         (double) prusagebuf_start.pr_wtime.tv_nsec / NANOSEC;
  wait_time = (double) prusagebuf_end.pr_wtime.tv_sec +
         (double) prusagebuf_end.pr_wtime.tv_nsec / NANOSEC - wait_time;

  minor_pfs = prusagebuf_end.pr_minf - prusagebuf_start.pr_minf;
  major_pfs = prusagebuf_end.pr_majf - prusagebuf_start.pr_majf;
  input_blocks = prusagebuf_end.pr_inblk - prusagebuf_start.pr_inblk;
  output_blocks = prusagebuf_end.pr_oublk - prusagebuf_start.pr_oublk;
/*  iochars = prusagebuf_end.pr_ioch - prusagebuf_start.pr_ioch;*/
}

void iobench_report() {
  fprintf(stderr, "Process info:\n");
  fprintf(stderr, "  elapsed time  %g\n", real_time);
/*  fprintf(stderr, "  total time    %g\n", total_real_time);*/
  fprintf(stderr, "  CPU time      %g\n", user_time + system_time);
  fprintf(stderr, "  user time     %g\n", user_time);
  fprintf(stderr, "  system time   %g\n", system_time);
  fprintf(stderr, "  trap time     %g\n", trap_time);
  fprintf(stderr, "  wait time     %g\n",  wait_time);
  fprintf(stderr, "  pfs    major/minor  %6lu/%6lu\n", major_pfs, minor_pfs);
  fprintf(stderr, "  blocks input/output %6lu/%6lu\n", input_blocks, output_blocks);
/*  fprintf(stderr, "  char inp/out  %lu\n", iochars);*/
  fprintf(stderr, "\n");

/*  fprintf(stderr, "%2.5g Mbytes/sec (real time)\n", iochars /
         real_time / 1e6);
  fprintf(stderr, "%2.5g Mbytes/sec (CPU time) \n", iochars /
         (user_time + system_time) / 1e6);*/

  fprintf(stderr, "%2.1f%% CPU usage\n", 100 * (user_time + system_time) /
         real_time + .05);
}

=================================TOP=============
 Q136: Roll-your-own Semaphores   

[For systems that don't support the realtime extensions (where POSIX
semaphores are defined -- they're NOT in Pthreads).]


In article , 
[email protected] says...
> [[ PLEASE DON'T SEND ME EMAIL COPIES OF POSTINGS ]]
> 
> [email protected] (Bob Withers) writes:
> 
> >Thanks much for this info.  Unfortunately I need the semaphores for 
> >inter-process mutual exclusion which makes sem_open important.  I'll just 
> >have to stick with SysV semaphores until we can move to 2.6.
> 
> 
> Well, you can mmap a semaphore in a file if you wish.
> 

Well you sure can and, believe it or not, I actually thought of it before 
I read your post.  My code has not been thoroughly tested but I'm posting 
it here in the hopes that it will be of help to someone else.  Either 
that or I'm just a glutten for criticism.  :-)

Casper, thanks much for your help.

Bob


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


sem_t *sem_open(const char *name, int oflag, ...)
{
    auto     int                need_init = 0;
    auto     int                val = 1;
    auto     int                fd;
    auto     sem_t *            sem = (sem_t *) -1;
    auto     struct stat        st;

    /* -----------------2/11/98 2:12PM-------------------
     * open the memory mapped file backing the shared
     * semaphore to see if it exists.
     * --------------------------------------------------*/
    fd = open(name, O_RDWR);
    if (fd >= 0)
    {
        /* -----------------2/11/98 2:13PM-------------------
         * the semaphore already exists, it the caller
         * specified O_CREAT and O_EXCL we need to return
         * an error to advise them of this fact.
         * --------------------------------------------------*/
        if ((oflag & O_CREAT) && (oflag & O_EXCL))
        {
            close(fd);
            errno = EEXIST;
            return(sem);
        }
    }
    else
    {
        auto     int                sem_mode;
        auto     va_list            ap;

        /* -----------------2/11/98 2:14PM-------------------
         * if we get here the semaphore doesn't exist.  if
         * the caller did not request that ir be created then
         * we need to return an error.  note that errno has
         * already been set appropriately by open().
         * --------------------------------------------------*/
        if (0 == (oflag & O_CREAT))
            return(sem);

        /* -----------------2/11/98 2:15PM-------------------
         * ok, we're going to create a new semaphore.  the
         * caller should've passed mode and initial value
         * arguments so we need to acquite that data.
         * --------------------------------------------------*/
        va_start(ap, oflag);
        sem_mode = va_arg(ap, int);
        val = va_arg(ap, int);
        va_end(ap);

        /* -----------------2/11/98 2:16PM-------------------
         * create the semaphore memory mapped file.  if this
         * call returns an EEXIST error it means that another
         * process/thread snuck in and created the semaphore
         * since we discovered it doesn't exist above.  we
         * don't handle this condition but rather return an
         * error.
         * --------------------------------------------------*/
        fd = open(name, O_RDWR | O_CREAT | O_EXCL, sem_mode);
        if (fd < 0)
            return(sem);

        /* -----------------2/11/98 2:18PM-------------------
         * set flag to remember that we need to init the
         * semaphore and set the memory mapped file size.
         * --------------------------------------------------*/
        need_init = 1;
        if (ftruncate(fd, sizeof(sem_t)))
        {
            close(fd);
            return(sem);
        }
    }

    /* -----------------2/11/98 2:19PM-------------------
     * map the semaphore file into shared memory.
     * --------------------------------------------------*/
    sem = (sem_t *) mmap(0, sizeof(sem_t), PROT_READ | PROT_WRITE,
                            MAP_SHARED, fd, 0);
    close(fd);
    if (sem)
    {
        /* -----------------2/11/98 2:19PM-------------------
         * if the mapping worked and we need to init the
         * semaphore, do it now.
         * --------------------------------------------------*/
        if (need_init && sem_init(sem, 1, val))
        {
            munmap((caddr_t) sem, sizeof(sem_t));
            sem = 0;
        }
    }
    else
    {
        sem = (sem_t *) -1;
    }

    return(sem);
}


int sem_close(sem_t *sem)
{
    return(munmap((caddr_t) sem, sizeof(sem_t)));
}


int sem_unlink(const char *name)
{
    return(remove(name));
}

=================================TOP=============
 Q137: Solaris sockets don't like POSIX_C_SOURCE!  

A little known requirement in Solaris is that when you define POSIX_C_SOURCE,
you must also define __EXTENSIONS__ when including sys/socket.h.  Hence,
your file should look like this:


#define _POSIX_C_SOURCE 199506L 
#define __EXTENSIONS__

#include 
#include 
#include 
...
        ================

That's because POSIX_C_SOURCE of 1995 vintage doesn't include
socket calls.

The feature macros are *exclusion* macros, not *inclusion* macros.

By default, you will get everything.

When you define something, you get *only* that something.
(Unless you also define __EXTENSIONS__)

This is slightly different in cases where the behaviour is
modified by the macro as in some socket calls.

Casper
        ======================
From: [email protected] (David Robinson)

The gratuitous use of non-POSIX conforming typedefs in headers is the
root cause. (Should use ushort_t not u_short and uint_t not u_int)
When defining POSIX_C_SOURCE it says use only strictly POSIX conforming
features, typedefs thus can only end in _t.

Good news is that 90+% of the offending headers are fixed in 2.7.

    -David
        ================
% A question... should I use -mt or -D_POSIX_C_SOURCE=199506L to compile
% a pthread program on Solaris 2.6?  If I use the latter even the most simple
% socket program won't compile.  For example,

Well, these do different things. -mt sets up the necessary macro definitions
for multi-threading, and links with the appropriate libraries. _POSIX_C_SOURCE
tells the compiler that your application is supposed to strictly conform
to the POSIX standard, and that the use of any non-POSIX functions or types
that might be available on the system should not be allowed.

The advantage of this is that, when you move to another system which provides
POSIX support, you are assured of your program compiling, however this
requires some work up-front on your part.

So the answer to your question is that you should use -mt to tell the
compiler your application is multi-threaded, and use _POSIX_C_SOURCE only
if your application is intended to conform strictly to POSIX.

HP's compiler is quite frustrating in this regard, since it assumes by
default that your application is k&r; C. If you use the -Aa option to
tell it your application is ANSI C, it doesn't allow any functions which
aren't defined by ANSI. I always end up using -Ae to tell the compiler to
get stuffed and just compile my program with whatever's on the system, and
I port to HP last after a big change.

=================================TOP=============
 Q138: The Thread ID changes for my thread!  

    I'm using IRIX6.4 threads and MPI-SGI but I'm having
strange problems. To analyse and even debug my program I begun
to write some "similar behaviors" programs very simples, and I
detected a strange thing. Anybody can says me if I doing a
mistake or if was a problem with IRIX 6.4 systems.

    The problems is: WHEN I CHANGE THE THREAD PRIORITY
THE THREAD ID IS ALSO CHANGED. As you can imagine I have a
lot of problems when I try joining them.

    - If I use only threads call, I commented MPI calls,
the program works fine even if I link with mpi library.
The program changes the main thread priority, and after,
it creates 10 threads with other priorities. Threads Id
are sequential.

    - If I use threads and MPI call (only MPI_Init,
Comm_size, Comm_rank and Finalize) the SAME program does
a change on main thread id after prioriry changing.

    - Another thing: in the first case, on my execution,
thread id begun with id=10000 and the other are sequential
after 10000. In the second case, thread id begun with 10000
and after priority change it assumes id=30000.

    ANYBODY can explain me? TIA.

THE CODE IS:


#include        
#include        
#include        
#include        

pthread_attr_t attr;
pthread_mutex_t mutex;
pthread_cond_t cond;
int xcond=0;
int size,rank;

void *Slaves(void *arg)
{
    int i,j;

    pthread_mutex_lock(&mutex;);
    while(xcond==0)
       pthread_cond_wait(&cond;,&mutex;);
                         
    pthread_mutex_unlock(&mutex;);
    printf("Size: %d Rank: %d    Thread Id %x\n", size, rank,
pthread_self());
    fflush(stdout);
}

int main (int argc, char **argv)
{
    int i,k=10,r;
    pthread_t *ThreadId;  
    struct sched_param params;
    int sched;

    printf("THREAD MAIN BEFORE MPI INIT %x\n", pthread_self());
/*   
 *  This lines are commented to see if MPI calls influence over
 *  system behavior.
 
    MPI_Init(&argc;, &argv;);
    MPI_Comm_size(MPI_COMM_WORLD, &size;);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank;);

    printf("THREAD MAIN AFTER MPI INIT %x\n", pthread_self());

 */

/*
 * If I called MPI the main thread id will be changed after
 * this lines. When I just left MPI initialisation main thread
 * has the same ID that it had before, the problem arrives
 * from this point
 */

    params.sched_priority=20;
    pthread_setschedparam(pthread_self(), SCHED_RR, ¶ms;);

    printf("THREAD MAIN AFTER  PRIO CHG %x\n", pthread_self());


    
    if (argc==2)
       k=atoi(argv[1]);

    ThreadId= (pthread_t *) malloc(k*sizeof(pthread_t));

    pthread_attr_init(&attr;);
    pthread_mutex_init(&mutex;, NULL);
    pthread_cond_init(&cond;, NULL);

    printf("\n Creating %d threads - Main thread is %x  \n", k,
pthread_self());

    for(i=0; i != k; i++) {
        pthread_attr_setinheritsched(&attr;, PTHREAD_EXPLICIT_SCHED); 
        params.sched_priority=21;
        pthread_attr_setschedparam(&attr;, ¶ms;);
        r=pthread_create(&ThreadId;[i], &attr;, Slaves, NULL);
        if (r!=0) {
           printf("Error on thread creation! \n");
           exit(0);
        }
    }

    xcond=1;
    pthread_cond_broadcast(&cond;);

/*
 * It was to force threads execution, but this is not necessary
  
    for(;;) sched_yield(); 
 */

    for(i=0; i != k; i++) {
        r=pthread_join(ThreadId[i], NULL);
        if (r!=0) {
           printf("Error on joining threads...\n");
           exit(0);
         }
    }
    printf(" Thead MAIN with id %x terminating...\n", pthread_self());
/*
    MPI_Finalize();
*/
}


=================================TOP=============
 Q139:  Does X11 support multithreading ?  
> > I am developing a multithreaded app under Solaris 2.5.1 (UltraSPARC),
> > using mixed Motif and X11, and i wonder if  someone can help me
> > answering some question:
> >
> > Does X11 support multithreading ?
> 
>   Well, kinda.  But...

Kinda?

The X Consortium releases of R6.x can be build MT-safe.

You can tell if you have R6 when you compile by checking the
XlibSpecificationRelease or XtSpecificationRelease feature test macros.
If they are >5 then your implementation may support threaded
programming. Call XInitThreads and/or XtToolkitInitializeThreads to find
out if your system's Xlib and Toolkit Intrinsics (libXt) really do
support threaded programming.

> 
> > Does Motif do the same ?
> 
>   No.  It's not thread-safe.

Motif 2.1 IS MT-safe.

> 
> > Can different threads open their own window and listen to their own
> > XEvents ? How could they do that ? [XNextEvent() can't specify a window
> > handle !].
> 
>   You don't.  Your main loop listens for any event, and then decides what
> to do with it.  Perhaps it hands off a task to another thread.

You can. Each thread could open a separate Display connection and do
precisely what Daniele asks. Even without separate Display connections,
the first thread to call XNextEvent will lock the Display, and the
second thread's call to XNextEvent will block until the first thread
releases its lock. But you can't guarantee which thread will get a
particular event, so in the trivial case you can't be assured that one
thread will process events solely for one window.

> 
>   Take a look at the FAQ for the threads newsgroup (on the page below).  That
> will help a bit.  You may also want to get "Multithreaded Programming with Pthreads"
> which has a section on exactly this, along with some example code.  (No one else
> talks about this, but I thought it important.)

I recommend reading the Xlib and Xt specifications, which are contained
in each and every X Consortium release -- available at
ftp://ftp.x.org/pub/, or you can get ready-to-print PostScript of just
the Xlib and Xt specs from ftp://ftp.x.org/pub/R6.3/xc/doc/hardcopy.
=================================TOP=============
 Q140: Solaris 2 bizzare behavior with usleep() and poll()  
>Jeff Denham wrote:
>> 
>> Adam Twiss wrote:
>> 
>> > You really really don't want to use usleep() in a threaded environment.
>> > On some platforms it is thread safe, but on Solaris it isn't.  The
>> > affects are best described as "unpredictable", but I've seen a usleep()
>> > call segv because it was in a theaded program on Solaris.
>> >
>> > You want to use nanosleep() instaed.
>> >
>> > Adam
>> 
>> I've found that poll(0, 0, msec-timeout-value)
>> works pretty well. Is there significant overhead calling
>> poll in this manner?
>
>It's not uncommon to use poll() or select() for sleeping. Works
>great.

I've seen an occasional bug in Solaris 2.6 where poll() will fail to
restore a pre-existing SIGALRM handler when it returns.  The sequence
is:

    sigaction(SIGALRM,...);
    alarm();
    ...
    poll(0, 0, timeout);
    ...
    (program exits with "Alarm clock" error)

Looking at the truss output, poll() appears to be the only place
after the initial sigaction where the handler for SIGALRM is changed.
The failure is difficult to reliably reproduce, but I've seen it
happen about 10% of the time.  It only happens on 2.6; this same code
works fine on Solaris 2.5.1, AIX 3.2.5 and 4.2.1, HP-UX 9.04 and
10.10, and SCO OpenServer 5.0.4.

The same thing happens with usleep.  I haven't tried nanosleep.

The program in question is single-threaded.

I haven't had the chance to pursue this problem yet; there may be a
fix for it, or it may be some really subtle application bug.

Michael Wojcik                      [email protected]
AAI Development, Micro Focus Inc.
Department of English, Miami University

Q: What is the derivation and meaning of the name Erwin?
A: It is English from the Anglo-Saxon and means Tariff Act of 1909.
-- Columbus (Ohio) Citizen
=================================TOP=============
 Q141: Why is POSIX.1c different w.r.t. errno usage?  

Bryan O'Sullivan wrote:

> d> It's an issue because that implementation is "klunky" and, more
> d> precisely, inefficient.
>
> I must admit that optimising for uncommon error cases does not make
> much sense to me.

Sure. In my sentence, I would have to say that "klunky" was a more
important consideration than "inefficient".

However, use of errno is NOT strictly in "uncommon error cases". For
example, pthread_mutex_trylock returns EBUSY when the mutex is locked.
That's a normal informational status, not an "uncommon error". Similarly,
pthread_cond_timedwait returns ETIMEDOUT as a normal informational
status, not really an "error". There are plenty of "traditional" UNIX
functions that are similar. It's certainly not universal, but "uncommon"
is an overstatement.

> d> Still, why propagate the arcane traditions, just because they
> d> exist?
>
> Because they are traditions.  I think there is some non-trivial value
> in preserving interface consistency - principle of least surprise, and
> all that - and 1003.1c violates this for no particularly good reason.

Let's just say that the working group (a widely diverse and contentious
bunch) and the balloting group (an even larger and more diverse group)
were convinced that the reasons were "good enough". Arguing about it at
this point serves no purpose.

> d> Overloading return values with "-1 means look somewhere else for an
> d> error" is silly.
>
> Sure, it's silly, but it's the standard way for library calls on Unix
> systems to barf, and what 1003.1c does is force programmers to plant
> yet another gratuitous red flag in their brains, with wording similar
> to "hey!  everything else works in such-and-such a way, but *this* is
> *different*!".  I have enough red flags planted in my brain at this
> point that it resembles a pincushion, and I would gladly sacrifice a
> few to ugliness if that ugliness were at least consistent.

UNIX isn't even very consistent about that. Some return -1. Some have
symbolic definitions that "are usually" -1 but needn't be (at least in
terms of guaranteed portability and conformance). Some return NULL. Some
set errno and some don't, requiring that YOU set errno before making the
call if you care WHY it failed (e.g., sysconf).

Hey, even if there was a "C" in "UNIX", it would most definitely NOT
stand for "consistency". Adding threads to UNIX disturbed a lot of
cherished traditions... far more than most people are willing to
acknowledge until they stumble over the shards of the old landscape.
There was good reason for this, though, and the net result is of
substantial benefit to everyone. While the changes to errno may be one of
the first differences people notice, "in the scheme of things", it's
trivial. If it even raises your awareness that "something's different",
maybe it'll save a few people from some bad mistakes, and to that extent
it's valuable even merely as a psychological tool.

Hey... count your mixed blessings, Bryan. I would have reported errors by
raising exceptions, if there'd been any hope at all of getting that into
POSIX. ;-)

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q142: printf() anywhere AFTER pthread_create() crashes on HPUX 10.x  

>I've spend the last couple of days trying to track down an
>annoying problem when running a multithreaded program on
>HPUX build against DCE Threads...
>
>If I call printf anywhere AFTER pthread_create had been executed to start
>a thread, when my application ends I get thrown out of my rlogin shell.

We experienced a similar problem about a year ago.  It was a csh bug,
and a patch from HP fixed it.

=================================TOP=============
 Q143: Pthreads and Linux  

Wolfram Gloger wrote:


> You should always put `-lpthread' _after_ your source files:
>
> % gcc the_file.cpp -lpthread
>
> Antoni Gonzalez Ciria  writes:
>
> > When I compile this code with gcc ( gcc -lpthread the_file.cpp) the 
> > program executes fine, but doing so with g++( g++ -lpthread the_file.cpp)
> > the progran crashes, giving a Segmentation fault error.
> >
> > #include 
> >
> > void main(){
> >     FILE * the_file;
> >     char sBuffer[32];
> >
> >     the_file=fopen("/tmp/dummy","rb");
> >     fread( sBuffer, 12, 1, the_file);
> >     fclose( the_file);
> >
> > }
>
> Using `g++' as the compiler driver always includes the `libg++'
> library implicitly in the link.  libg++ has a lot of problems, and is
> no longer maintained (in particular, it has a global constructor
> interfering with glibc2, if I remember correctly).  If you really need
> it, you must get a specially adapted version for Linux/egcs/gcc-2.8.
>
> If you don't need libg++, please use `c++' or `gcc' as your compiler
> driver, and use the libstc++ library shipped with egcs or seperately
> with gcc-2.8 (`c++' will link libstdc++ in implicitly).
>
> When I just tested your program with egcs-1.0.1 and glibc-2.0.6, it
> crashed at first (it fails to check the fopen() result for being
> NULL), but after creating a /tmp/dummy file it ran perfectly, even
> after compiling it with `c++ the_file.cpp -lpthread'.
>
> Regards,
> Wolfram.

  The -pthread option takes care of everything:
     adds  -D_REENTRANT  during the cpp pass, and
     adds  -lpthread during the link-edit.
  This option has been around for a while. I'm not sure
   it's working for all ports. At least for the x86 AND glibc.
   You may want to take a look at the spec file (gcc -v)

jms.
=================================TOP=============
 Q144: DEC release/patch numbering  

It was after 4.0B, and 4.0C is just 4.0B with new hardware support. (If you
install a "4.0C" kit on any hardware that doesn't need the new support, it
will even announce itself as 4.0B.) Although this is not true of all
components, DECthreads policy has been to keep 4.0 through 4.0C identical --
we have always submitted any patches to the 4.0 patch stream, and propagated
the changes through the other patch streams, releasing "functionally
identical" patches that are simply built in the appropriate stream's
environment. (But note that all future patches for the 4.0 - 4.0C stream
will be available only on 4.0A and later... 4.0 is no longer supported.)

The changes are in source in 4.0D and later.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP=============
 Q145: Pthreads (almost) on AS/400  
Fred A. Kulack wrote:

> Hi All.
> You may or may not know, that new in the v4r2m0 release of OS/400 is support
> for kernel threads. Most notably the support for threads is available via
> native Java, but we've also implemented a Pthreads library that is 'based on'
> Posix and Unix98
> Pthreads.

Thank you very much for this update. I'm not likely to ever have occasion or
need to use this, but I like to keep track of who's implemented threads (and
how, and which threads). If nothing else, it provides me with more information
to answer the many questions I get on threading.

> The implementation claims no compliance because there are some differences
> and we haven't implemented all of the APIs. We do however duplicate the
> specification for the APIs that are provided, and we have quite a full set
> of APIs.

Yeah, I understand that one. Same deal on OpenVMS. Most of the APIs are there,
and do more or less what you'd expect -- but VMS isn't POSIX, and some parts
just don't fit. Congratulations on "doing your best". (At least, since you say
"we", I'm making the liberal assumption that you bear some personal
responsibility for this. ;-) )

> Anyone whose interested, can take a look at
> http://www.as400.ibm.com/developer/threads

=================================TOP=============
 Q146: Can pthreads & UI threads interoperate in one application? 

>Can Solaris pthread/UI thread (pthread_xxxx() versus thr_xxx())
>interoperate in one application ?   Is solaris pthread implemented
>as user level threads ?  I'v read JNI book which says the thread
>model used in the native code must be interoperatable with JVM thread
>model used.  An example the book gives is that if the JVM using user
>level thread (Java green thread) and the native code using Solaris
>native thread, then it will have problem to interoperate.  Does this
>apply to pthread & UI thread interoperatibility on Solaris, if pthread
>is kind of user level thread ?
>
>Also when people say Solaris native thread, does it mean the UI thread
>(thr_xxx() calls) only or does it also include Solaris pthread ?

Yes. They are built on the same underlying library. Indeed, many
of the libraries you use everyday are built using UI threads and
they get linked into Pthreads programs all the time.

"Implemented at user level" isn't quite the right way of describing
it. "Does the library use LWPs?" is the real question. Green threads
don't, so you can't make JNI calls to pthreads or UI threads. Native
threads do, and you can.

When folks say "Solaris native threads" they mean either pthreads or
UI threads, NOT green threads.

For a more detailed discussion, see my *excellent* book on Java
Threads: "Multithreaded Programming with Java".

-Bil
=================================TOP===============================

 Q147: Thread create timings  

Matthew Houseman  writes:

Thought I'd throw this into the pyre. :)  I ran the thread/process create
stuff on a 166MHz Pentium (no pro, no mmx) under NT4 and Solaris x86 2.6:


NT spawn                240s    24.0  ms/spawn
Solaris spawn (fork)    123s    12.3  ms/spawn  (incl. exec)
Solaris spawn (vfork)    95s     9.5  ms/spawn  (incl. exec)

Solaris fork             47s     4.7  ms/fork
Solaris vfork                    0.37 ms/vfork  (37s/100000)

NT thread create         12s     1.2  ms/create
Solaris thread create            0.11 ms/create (11s/100000)


As you can see, I tried both fork() and vfork(). When doing an immediate
exec(), you'd normally use vfork(); when just forking, fork() is usually
what you want to use (or have to use).

Note that I had to turn the number of creates up to 100000 for vfork
and thread create to get better precision in the timings.


To remind you, here are greg's figures (on a Pentium MMX 200MHz):

>NT Spawner (spawnl):            120 Seconds (12.0 millisecond/spawn)
>Linux Spawner (fork+exec):       57 Seconds ( 6.0 millisecond/spawn)
>
>Linux Process Create (fork):     10 Seconds ( 1.0 millisecond/proc)
>
>NT Thread Create                  9 Seconds ( 0.9 millisecond/thread)
>Linux Thread Create               3 Seconds ( 0.3 millisecond/thread)


Just for fun, I tried the same thing on a 2 CPU 170MHz Ultrasparc.
I leave it to someone else to figure out how much of this is due to
the two CPUs... :)

Solaris spawn (fork)            84s     8.4  ms/spawn  (incl. exec)
Solaris spawn (vfork)           69s     6.9  ms/spawn  (incl. exec)

Solaris fork                    21s     2.1  ms/fork
Solaris vfork                           0.17 ms/vfork  (17s/100000)

Solaris thread create                   0.06 ms/create (6s/100000)


=================================TOP=============
 Q148: Timing Multithreaded Programs (Solaris)  

From: [email protected] (Richard Sullivan)

>I'm trying to time my multithreaded programs on Solaris with multiple 
>processors.  I want the real world running time as opposed to the total 
>execution time of the programming because I want to measure speedup versus 
>sequential algorithms and home much faster the parallel program is for the user.

Bradly,

  Here is what I wrote to solve this problem (for Solaris anyway).  To
use it just call iobench_start() after any setup that you don't want
to measure.  When you are done measuring call iobench_end().  When you
want to see the statistics call iobench_report().  The output to
stderr will look like this:

Process info:
  elapsed time  249.995
  CPU time      164.446
  user time     152.095
  system time   12.3507
  trap time     0.661235
  wait time     68.6506
  pfs    major/minor    3379/     0
  blocks input/output      0/     0
 
65.8% CPU usage

The iobench code is included in the program sources on: index.html.
=================================TOP=============
 Q149: A program which monitors CPU usage?  

> >Ok, I've tried some web searches and haven't found anything I like the
> >look of.  What I'm after is a program which runs in the background and
> >monitors (primarily) CPU usage for our web server (an Ultra-1 running
> >Solaris 2.6).  However, all the programs I've found are about 2 years
> >old and/or don't run on 2.6.
> >
> >I've seen top, but it doesn't really do what I want; I'd like to have
> >the output from the program as a %cpu usage for each hour (or some
> >other arbitrary time period) stored as a log file or, ideally, as a
> >graph (in some graphics format, eg, .gif).
> 
> Sounds like what sar does, and it comes with 2.6 - to enable recording
> data for it, just uncomment the lines in /etc/init.d/perf and the
> crontab for the 'sys' account.

From what I've read on the product, sounds like 'spong' might be what
you need. I've downloaded it, but haven't had time to install and set up
yet. Try:

http://strobe.weeg.uiowa.edu/~edhill/public/spong/
=================================TOP=============
 Q150: standard library functions: whats safe and whats not?  

From: [email protected] (W. Richard Stevens)
Subject: Re: standard library functions: whats safe and whats not?
Date: 17 Feb 1998 14:19:28 GMT

> 1.  Which of the standard C library functions are thread-safe and
> which aren't?  For example, I know that strtok() is un-safe, I can
> infer that from its functionality, but what about the thousands of
> other library calls? I don't want to examine each one individually
> and make guesses about thread safety.
>
> Is there a list somewhere of what's safe and whats not?

Page 32 of the 1996 Posix.1 standard says "All functions defined by
Posix.1 and the C standard shall be thread-safe, except that the following
functions need not be thread-safe:

    asctime()
    ctime()
    getc_unlocked()*
    getchar_unlocked()*
    getgrid()
    getgrnam()
    getlogin()
    getpwnam()
    getpwuid()
    gmtime()
    localtime()
    putc_unlocked()*
    putchar_unlocked()*
    rand()
    readdir()
    strtok()
    ttyname()"

Note that a thread-safe XXX_r() version of the above are available,
other than those with an asterisk.  Also note that ctermid() and
tmpnam() are only thread-safe if a nonnull pointer is used as an
argument.

    Rich Stevens


        ================

POSIX and ANSI C specify only a small part of the "traditional UNIX
programming environment", though it's a start. The real danger in reading the
POSIX list quoted by Rich is that most people don't really know what's
included. While an inclusive list would be better than an exclusive list,
that'd be awfully long and awkward.

The Open Group (OSF and X/Open) has extended the Single UNIX Specification
(also known as "SPEC1170" for it's 1,170 UNIX interfaces, or UNIX95) to
include POSIX.1b-1993 realtime, POSIX.1c-1995 threads, and various
extensions. It's called the Single UNIX Specification, Version 2; or UNIX98.
Within this calendar year, it's safe to assume that most UNIX versions
currently branded by The Open Group (as XPG3, UNIX93, UNIX95) will extend
their brand validation to UNIX98.

The interfaces specification part of the Single UNIX Specification, Version 2
(known as XSH5), in section 2.8.2, "Thread-safety", specifies that all
interfaces defined by THIS specification will be thread-safe, except for "the
following". There are two explicit lists, and one implicit. One is the POSIX
list already quoted by Rich Stevens. The second is an additional list of
X/Open interfaces:

basename      dbm_open    fcvt        getutxline    pututxline
catgets       dbm_store   gamma       getw          setgrent
dbm_clearerr  dirname     gcvt        l64a          setkey
dbm_close     drand48     getdate     lgamma        setpwent
dbm_delete    ecvt        getenv      lrand48       setutxent
dbm_error     encrypt     getgrent    mrand48       strerror
dbm_fetch     endgrent    getpwent    nl_langinfo
dbm_firstkey  endpwent    getutxent   ptsname
dbm_nextkey   endutxent   getutxid    putenv

The implicit list is a statement that all interfaces in the "Legacy" feature
group need not be thread-safe. From another section, that list is:

advance       gamma          putw        sbrk          wait3
brk           getdtablesize  re_comp     sigstack
chroot        getpagesize    re_exec     step
compile       getpass        regcmp      ttyslot
cuserid       getw           regex       valloc
        
loc1          __loc1         loc2        locs

Obviously, this is still an exclusive list rather than inclusive. But then,
if UNIX95 had 1,170 interfaces, and UNIX98 is bigger, an inclusive list would
be rather awkward. (And don't expect ME to type it into the newsgroup!)

On the other hand... beware that if you've got a system that doesn't claim
conformance to POSIX 1003.1c-1995 (or POSIX 1003.1-1996, which includes it),
then you're not guaranteed to be able to rely even on the POSIX list, much
less the X/Open list. It's reasonable to assume that any implementation's
libpthread (or equivalent, though that name has become pretty much defacto
standard) is thread-safe. And it's probably reasonable to assume, unless
specified otherwise, that "the most common" bits of libc are thread-safe. But
without a formal statement of POSIX conformance, you're just dealing with
"good will". And, even at that, POSIX conformance isn't validated -- so
without validation by the UNIX98 branding test suite, you've got no real
guarantee of anything.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP=============
 Q151: Where are semaphores in POSIX threads?  

David McCann wrote:

> Jan Pechanec wrote:
> >
> > Hello,
> >
> >         I have a summary of POSIX papers on threads, but there in no
> > imformation about semaphores (just conditional vars, mutexes). *NO*
> > pthread_semaphoreinit() etc.
> >
> >         In some materials, there is information on sem_wait(), sem_send() (or
> > sm. like that), but is it for threads (or just processes)?
>
> I think this whole discussion has digressed from Jan's original question
> above. Yes, there are sem_* calls in Solaris 2.5 (and 2.4 IIRC); you
> just need to link with -lposix4 or whatever to get them. But these are the
> *POSIX.1b* semaphores, which are *process-based* semaphores. They have
> nothing to do with threads.
>
> Now what Jan wants here is semaphore calls for *POSIX.1c*, i.e. POSIX
> threads. Now, IIRC, the sem_* calls are NOT specified in POSIX.1c, but
> rather their behaviour in MT programs has been clarified/refined in XPG5
> (Unix98) which allows you to use semaphores to synchronize threads and/or
> processes, depending on how you use them.

Not quite true. Yes, XSH5 (the "system interfaces" part of XPG5) says this; but it
does so because POSIX 1003.1-1996 says so, not because it's added something to
POSIX.

In fact, POSIX 1003.1b semaphores were designed by the same working group that did
1003.1c, and while 1003.1b-1993 was approved and published first (and therefore
couldn't mention threads), several aspects of 1003.1b were designed to work with
threads. For example, there are "realtime" extended versions of the 1003.1c
sigwait functions (sigtimedwait and sigwaitinfo). (The interfaces are slightly
incompatible because they return -1 and set errno on errors, rather than returning
an error code: that's because 1003.1c removed the use of errno AFTER 1003.1b was
finalized.)

Additionally, the sem_init function was designed with a parameter corresponding to
the "pshared" attribute of mutexes and condition variables. For 1003.1b, the only
supported value was 1, meaning "process shared". 1003.1c amended the sem_init
description to specify in addition that the value 0 meant "process private", for
use only between threads within the process. (But also note that it's perfectly
reasonable to create a "process shared" semaphore and then use it only between
threads within the process -- it may be less efficient on some implementations,
but it does the same thing.)

> Solaris 2.5 is not Unix98-conformant; the confusion arises because it
> *does* appear to be compliant with POSIX.1b and POSIX.1c (somebody at Sun can
> surely verify this). From what's been said here, I assume 2.6 is either Unix98-
> compliant, or at least contains the MT extensions to POSIX of Unix98.

Solaris 2.5 supports (most of) 1003.1b and 1003.1c, although there were a few
omissions and a few interpretation errors. (Like any other implementation.) This,
however, is not one of them. Solaris 2.5 does NOT define _POSIX_SEMAPHORES in
, which is the way an implementation should advertise support for POSIX
semaphores. Therefore, while it may not implement all capabilities described by
1003.1b and 1003.1c, it doesn't (in this case, anyway) violate the standard. If
you're using POSIX semaphores (even if they seem to work) on Solaris 2.5, then
your application is not "strictly conforming", and if you're subject to any
incompatibilities or porting problems, that's your fault, not the fault of
Solaris. IT says they're not there.

(And, yes, POSIX specifically allows an implementation to provide interfaces while
claiming they're not there; and if it does so, it's not obligated to provide
strict conformance to the standard's description of those interfaces. This is what
Solaris 2.5 should have done, also, with the _POSIX_THREAD_PRIORITY_SCHEDULING
option, since it doesn't completely implement those interfaces.)

Presumably, Solaris 2.6 (though I don't have a system handy to check) DOES define
_POSIX_SEMAPHORES.

> At any rate, you can't use the sem_* calls for thread synchronization in
> 2.5; you get ENOSYS in MT programs. I know, I've tried it (on 2.5.1).
> AFAIK, single-threaded programs linked with -lposix4 work fine, but as I
> said above, they're only for process-based semaphores. So if you want to use the
>
> sem_* calls for thread-synchronization on Solaris, you have to go to 2.6.

First off, other replies have indicated that it's actually libthread, not
libposix4, that provides "working" (though not complete) POSIX semaphores. Most
likely, these semaphores would work with the "pshared" parameter set to either 0
(process) or 1 (cross-process). However, in any case, if you've got something that
can synchronize between processes, you should expect that it can synchronize
between threads as well, though there may be alternatives that are more efficient
on some implementations. (E.g., a pshared mutex will usually be more expensive to
lock or unlock than a private mutex.) (Such a difference in efficiency is less
likely for semaphores, since POSIX already requires that sem_post be async-signal
safe, which means it's far better and easier to keep the implementation inside the
kernel regardless of the pshared argument.)

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q152: Thread & sproc (on IRIX)  

In article <[email protected]>,
Yann Boniface   wrote:
>I'm having trouble while using threads and processus on a massive
>parallel machine (SGI).
>The processus creation is OK (sproc (myFunction, PR_SADDR, arg)) as long
>as I don't use pthread library. If I compile the program with the flag
>-lpthread, processus creation didn't work any more, even if I don't
>explicity use thread functions (errno is then ENOTSUP)

You shouldn't mix pthreads and sprocs. You should stick with one or
the other (IMHO pthreads are preferable).
-- 
Planet Bog -- pools of toxic chemicals bubble under a choking
atomsphere of poisonous gases... but aside from that, it's not
much like Earth.
=================================TOP=============
 Q153:  C++ Exceptions in Multi-threaded Solaris Process  

Jeff Gomsi  writes:

> We are running a large multi-threaded C++ (C++ 4.2 patch 
> 104631-03) application under Solaris (SunOS 5.5.1 
> Generic_103640-14) on a 14 processor Ultra-Enterprise and 
> observe the following problem.
> 
> The application runs fine single-threaded, but when run
> multi-threaded, throwing a C++ exception can (evidently) 
> cause memory corruption which leads to a SIGSEGV core
> dump. A diagnostic version of the new operator seems to
> reveal that C++ is unwinding things improperly and possibly
> calling destructors which should not be called.
> 
> Does anyone have any ideas on this?

The last time I looked at the patch list for the C++ 4.2, I noticed a
mention of a bug stating that exceptions were not thread safe.  There was
no further description of this bug that I could find.  However, it
supposedly is addressed by one of the later patches. Try upgrading your
patch to -04 or -05....

- Chris

Make sure you have the libC patch 101242-13.
=================================TOP=============
 Q154:  SCHED_FIFO threads without root privileges  ?  

Laurent Deniel wrote:

> Hi,
>
>  Is there a way to create threads that have the SCHED_FIFO scheduling
>  without root privileges (in system contention scope) ? by for instance
>  changing a kernel parameter (Digital UNIX 4.0B & D or AIX 4.2) ?
>
>  Thanks in advance,

In Digital UNIX 4.0, using process contention scope (PCS) threads,
any thread can set FIFO policy; further, it can set any priority.
Why? Because the policies and priorities for PCS threads affect only
the threads in the containing process. PCS FIFO/63 threads are really
important in relation to other PCS threads in the process, but have no
influence on the scheduling of other threads in other processes.
The aspect is controlled by the policies and priorities of  the kernel
scheduling entities (VPs -- virtual processors) underlying the PCS
threads, and those characteristics are unaffected by the POSIX
scheduling interfaces.

On V4.0D, newly released, system contention scope (SCS) threads
are supported. Their policies and priorities are by definition seen
by the kernel scheduler and are therefore subject to privilege
restrictions. In short, you can set SCS threads to FIFO or RR policy
without  privilege on V4.0D, but FIFO threads cannot exceed POSIX
prio 18 and RR threads cannot exceed 19. Regardless of this
"limitation," it gives you plenty or rope to hang yourself with!

__________________________________________________
Jeff Denham ([email protected])
=================================TOP=============
 Q155: "lock-free synchronization"  

> I recently came across a reference to "lock-free synchronization" (in
> Taligent's Guide to Designing Program's.)  This document referred to
> research that was looking at using primitive atomic operations to build more
> complex structures in ways that did not require locking.
>
> I'm interested in exploring this topic further and would be greatful if
> anyone could supply references.
>
> Regards,
> Daniel Parker
>
>

Check out the following references --

  M. Herlihy, "Wait free Synchronization," ACM Transactions on Programming
Languages and Systems, Vol 13, No 1, 1991, pp. 124-149.

  M. Herlihy, "A Methodology for Implementing Highly Concurrent Data
Objects," same journal as above, Vol 15, No. 5, 1993, pp. 745 --770.

They should provide a starting point.

=================================TOP=============
 Q156: Changing single bytes without a mutex  

Tim Beckmann wrote:

> David Holmes wrote:
> >
> > I thought about this after posting. An architecture such as Bil describes
> > which requires non-atomic read/mask/write sequences to update variables of
> > a smaller size than the natural word size, would be a multi-threading
> > nightmare. As you note above two adjacent byte values would need a common
> > mutex to protect access to them and this applies even if they were each
> > used by only a single thread! On such a system I'd only want to program
> > with a thread-aware language/compiler/run-time.
> >
> > David
>
> David,
>
> My thoughts exactly!
>
> Does anyone know of a mainstream architecture that does this sort of
> thing?

Oh, absolutely. SPARC, MIPS, and Alpha, for starters. I'll bet most other RISC
systems do it, too, because it substantially simplifies the memory subsystem
logic. And, after all, the whole point of RISC is that simplicity means speed.

If you stick to int or long, you'll probably be safe. If you use anything
smaller, be sure they're not allocated next to each other unless they're under
the same lock.

I wrote a long post on most of the issues brought up in this thread, which
appears somewhere down the list due to the whims of news feeds, but I got
interrupted and forgot to address this issue.

If you've got

     pthread_mutex_t mutexA = PTHREAD_MUTEX_INITIALIZER;
     pthread_mutex_t mutexB = PTHREAD_MUTEX_INITIALIZER;

     char dataA;
     char dataB;

And one thread locks mutexA and writes dataA while another locks mutexB and
writes dataB, you risk word tearing, and incorrect results. That's a "platform
issue", that, as someone else commented, POSIX doesn't (and can't) address.

What do you do? I always advise that you keep a mutex and the data it protects
closely associated. As well as making the code easier to understand, it also
addresses problems like this. If the declarations were:

     typedef struct dataTag {
         pthread_mutex_t mutex;
         char data;
     } data_t;

     data_t dataA = {PTHREAD_MUTEX_INITIALIZER, 0};
     data_t dataB = {PTHREAD_MUTEX_INITIALIZER, 1};

You can now pretty much count on having the two data elements allocated in
separate "memory access chunks". Not an absolute guarantee, since a
pthread_mutex_t might be a char as well, and some C compilers might not align
structures on natural memory boundaries. But most compilers on machines that
care WILL align/pad structures to fit the natural data size, unless you
override it with a switch or pragma (which is generally a bad idea even when
it's possible). And, additionally, a pthread_mutex_t is unlikely to be less
than an int, and is likely at least a couple of longs. (On Digital UNIX, for
example, a pthread_mutex_t is 48 bytes, and on Solaris it's 24 bytes.)

There are, of course, no absolute guarantees. If you want to be safe and
portable, you might do well to have a config header that typedefs
"smallest_safe_data_unit_t" to whatever's appropriate for the platform. Then
it's just a quick trip to the hardware reference manual when you start a port.
On a CISC, you can probably use "char". On most RISC systems, you should use
"int" or "long".

Yes, this is one more complication to the process of threading old code. But
then, it's nothing compared to figuring out which data is shared and which is
private, and then getting the locking protocols right.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

> If I'm not mistaken, isn't that spelled:
>
>     #include 
>
>     typedef sig_atomic_t smallest_safe_data_unit_t;

You are not mistaken, and thank you very much for pointing that out. While I'd
been aware at some point of the existence of that type, it was far from the top
of my mind.

If you have data that you intend to share without explicit synchronization, you
should be safe in using sig_atomic_t. Additionally, using sig_atomic_t will
protect you against word tearing in adjacent data protected by separate mutexes.

There are additional performance considerations, such as "false sharing" effects
in cache systems, that might dictate larger separations between two shared pieces
of data: but those won't affect program CORRECTNESS, and are therefore more a
matter of tuning for optimal performance on some particular platform.

=================================TOP=============
 Q157: Mixing threaded/non-threadsafe shared libraries on Digital Unix  

claude vallee wrote:

> Hi All.  I have a question on building a mutli-threaded process (on
> Digital Unix 4.0) which is linked with non thread safe shared libraries.
>
> Let's say:
>
> mymain.c has no calls to thread functions and none of its functions runs
> in a secondary thread.  I will compile this file with the -pthread
> option. (I call secondary thread any but the main thread)
>
> liba.so contains non thread safe functions, but I know for a fact that
> none of its functions will run in a secondary thread.  This library was
> not built using the -pthread option.
>
> libb.so is my multi-thread library.  It creates threads and its
> functions are thread safe or thread reentrant.  All of its code was
> compiled with the -pthread option.  All the code executing in a
> secondary thread is in this library.
>
> The questions are:
>
> 1. Will this work?  If liba.so was not built with threads options, is it
> all right if it runs only in the main thread?  Which c runtime library
> will be used at run time? libc or libc_r?

On Digital UNIX 4.0D, this should be OK. On earlier versions, you need to
be careful around use of errno. For various historical reasons I won't try
to explain (much less justify), setting of errno through libc (using the
"seterrno()" function or equivalent) would previously result in setting the
process global errno ("extern int errno;"), not just the per-thread errno
of the calling thread.

For 4.0D, I was able to change the code so that seterrno() always sets the
calling thread's errno, and also sets the global errno ONLY if called from
the initial ("main") thread of the process. With this change, it's safe (as
least as far as errno use) to run non-thread-aware libraries as long as you
use them only in the initial thread.

To make this clear, prior to 4.0D, your liba code running in the main
thread may see errno change at random. As long as liba doesn't read errno,
this shouldn't be a problem.

You do have to be aware of whether liba SETS the global errno -- because
your thread-safe code won't see the global errno through any normal
mechanisms.

> 2. I noticed that on my DU 4.0, the libc.so and libc_r.so are
> identical!!  I assume this means that I am always using the thread safe
> version of the libc library.  Is that correct?

Yes -- libc_r was another historical oddity. (Due to OSF/1 rules.) It no
longer exists, as of Digital UNIX 4.0. The (hard) link provides binary
compatibility for older programs that still reference libc_r.

> 3. What does -pthread do to my code?  I saw that my objects are
> different (in size anyway), and that my executable point to libmach and
> libpthread.  What is added to the object code?

There are two basic contributions of "-pthread":

   * At compile-time, the definition -D_REENTRANT is provided
   * At link-time, the proper libraries are added, to the end of the actual
     list of libraries, but immediately before the implicit reference to
     libc that's generated by the compile driver. Additionally, -pthread
     causes the linker to search for "reentrant" versions of any library
     you specify. (E.g., if you say "-lfoo" and there's a libfoo_r in your
     -L path, the linker will automatically use it.)

The primary effect of -D_REENTRANT is to change  -- references to
errno make a call into the thread library to get the thread's private errno
address rather than the global errno. There are some other changes to
various headers, though.

> 4. Does defining _THREAD_SAFE when compiling and linking with
> libpthread, libmach and libc_r equivalent to building with the -pthread
> option?

No, _THREAD_SAFE doesn't do anything. It's considered obsolete. You should
use _REENTRANT. (Though I actually prefer the former term, I've never felt
it was worth arguing, or making anyone change a ton of header files.)

> I did some tests, and everything works well... for the moment, but IMHO,
> it does not mean anything.  Everyone knows that non thread safe code
> will work perfectly fine until your demo ;-)

Depends. If the demo is a critical requirement for a multi-million dollar
sale, then, yeah, it can probably hurt you worst by failing then.
Otherwise, though, it'll have a lot more fun by SUCCEEDING at the demo, and
failing when the customer runs the code in their mission-critical
environment. This is a correllary to a correllary to Murphy's Law, which
stated something about the inherent perversity of inanimate objects...

Oh... and since liba is, presumably, a third-party library over which
you've got no direct control... you should tell them immediately that
you're running their code in a threaded application, and it would be to
their long-term benefit to build a proper thread-safe version before you
find another option. If liba is YOUR code, then please don't play this
game: build it at least with -D_REENTRANT.

/---------------------------[ Dave Butenhof ]--------------------------\

=================================TOP=============
 Q158: VOLATILE instead of mutexes?  


> What about exception handlers ? I've always thought that when you had
> code like:
>
>         int i;
>
>         TRY
>         {
>                 
>                 . . .
>                 proc();
>         }
>         CATCH_ALL
>         {
>                 if (i > 0)
>                 {
>                         . . .
>                 }
>                 . . .
>         }
>
> that you needed to declare "i" to be volatile least the code in the
> catch block assume that "i" was stored in some register the contents
> of which were overwritten by the call to "proc" (and not restored by
> whatever mechanism was used to throw the exception).

Since neither ANSI C nor POSIX has any concept remotely resembling "exceptions", this
is all rather moot in the context of our general discussion, isn't it? I mean, it's
got nothing to do with sharing data between threads -- and that's what I thought we
were talking about. But sure, OK, let's digress.

Since there's no standard covering the behavior of anything that uses exceptions, (at
least, not if you use them from C, or even if you use the DCE exception syntax you've
chosen from C++), there's no portable behavior. Your fate is in the hands of the
whimsical (and hypothetically malicious ;-) ) implementation. This situation might
lead a cautious programmer to be unusually careful when treading in these waters, and
to wear boots with thick soles. (One might also say that it could depend on exactly
HOW you "foodle with i", but I'll choose to disregard an entire spectrum of mostly
amusing digressions down that fork.)

Should you use volatile in this case? Sure, why not? It might not be necessary on
many platforms. It might destroy your performance on any platform. And, where it is
necessary, it might not do what you want. But yeah, what the heck -- use it anyway.
It's more likely (by some small margin) to save you than kill you.

Or, even better... don't code cases like this!

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP=============
 Q159: After pthread_cancel() destructors for local object do not get called?!  
 
> Hello,
> I've run into a trouble when I found out that when I cancel a thread via
> pthread_cancel() than destructors for local object do not get called.
> Surprising :). But how to deal with this? With a simple  thread code
> it would not be a big problem, but in my case it's fairly complex code,
> quite a few STL classes etc. Has someone dealt with such problem and is
> willing to share his/her soltution with me ? I thought I could 'cancel'
> thread via pthread_kill() and raise an exception within a signal handler
> but it's probably NOT very good idead, is it?;)
> Thank you,
>         Ales Pour
> 
> Linux, egcs-1.0.3, glibc-2.0.7 with LinuxThreads
Ales,

  Unfortunately, not surprising.  C++ has not formally decided what to do with
thread cancellation, so it becomes compiler-specific.  The Sun compiler (for 
example) will run local object destructors upon pthread_exit() (hence 
cancellation also).  Others may not.

  I suppose the best GENERAL C++ solution is:

    a) Don't use stack-allocated objects.
    b) Don't use cancellation.

  Otherwise you can simply insist on a C++ compiler that runs the destructors.

-Bil
=================================TOP=============
 Q160: No pthread_exit() in Java.  

 >    In POSIX, we have pthread_exit() to exit a thread.  In Java we
 >  *had* Thread.stop(), but now that's gone.  Q: What's the best way
 >  to accomplish this?
 >  
 >    I can (a) arrange for all the functions on the call stack to
 >  return, all the way up to the top, finally returning from the
 >  top-level function.  I can (b) throw some special exception I
 >  build for the purpose, TimeForThreadToExitException, up to the
 >  top-level function.  I can throw ThreadDeath.
 >  
 >    But what I really want is thread.exit().
 >  
 >    Thoughts?
 >  
 >  -Bil
 > -- 
 > ================
 > Bil LambdaCS.com
 > 
 > http://www.LambdaCS.com
 > Lambda Computer Science 
 > 555 Bryant St. #194 
 > Palo Alto, CA,
 > 94301 
 > 
 > Phone/FAX: (650) 328-8952
 > 

Here's a real quick reply (from a slow connecction from
Sydney AU (yes, visiting David among other things)). I'll
send something more thorough later....

Throwing ThreadDeath yourself is a pretty good way to force current
thread to exit if you are sure it is in a state where it makes sense
to do this.

But if you mean, how to stop other threads: This is one reason why
they are extremely unlikely to actually remove Thread.stop(). The next
best thing to do is to take some action that is guaranteed to cause
the thread to hit a runtime exception. Possibililies range from the
well-reasoned -- write a special SecurityManager that denies all
resource-checked actions, to the sleazy -- like nulling out a pointer
or closing a stream that you know thread needs. See
  http://gee.cs.oswego.edu/dl/cpj/cancel.html
for a discussion of some other alternatives.


-Doug


ThreadDeath is an Error (not a checked Exception, since app's routinely
catch all checked Exceptions) which has just the semantics you are talking
about: it is a Throwable that means "this thread should die".  If
you catch it (because you have cleanup to do), you are SUPPOSED to
rethrow it.  1.2 only, though, I think.  Thread.stop() uses it, but
although stop() is deprecated, it appears that ThreadDeath is not.

I think.  :^)

Nicholas

There is *nothing* special about a ThreadDeath object. It does not mean
"this thread should die" but rather it indicates that "this thread has
been asked to die". The only reason it "should" be rethrown is that if
you don't then the thread doesn't actually terminate. This has always
been documented as such and is not specific to 1.2.

If a thread decides that for some reason it can continue with its work
then it can simply throw  new ThreadDeath() rather than calling stop()
on itself. The only difference is that with stop() the Thread is
immediately marked as no longer alive - which is a bug in itself.

Cheers,
David
=================================TOP=============
 Q161: Is there anyway I can make my stacks red zone protected?  

Allocate your stack segments using mmap.  Use mprotect to make the
page after the bottom of your stack read-only (I'm assuming the stack
grows down on whatever system you're using), or leave a hole in your
address space.  If you get a segfault due to an attempted write at the
top of a red zone, map in some more stack and build a new red zone.



=================================TOP=============
 Q162: Cache Architectures, Word Tearing, and VOLATILE


Tim Beckmann wrote:

> Dave Butenhof wrote:
> > > David,
> > >
> > > My thoughts exactly!
> > >
> > > Does anyone know of a mainstream architecture that does this sort of
> > > thing?
> >
> > Oh, absolutely. SPARC, MIPS, and Alpha, for starters. I'll bet most other RISC
> > systems do it, too, because it substantially simplifies the memory subsystem
> > logic. And, after all, the whole point of RISC is that simplicity means speed.
>
> MIPS I know :)  The latest MIPS processors R10K and R5K are byte addressable.
> The whole point of RISC is simplicity of hardware, but if it makes the software
> more complex it isn't worth it :)

The whole idea of RISC is *exactly* to make software more complex. That is,
by simplifying the hardware, hardware designers can produce more stable
designs that can be produced more quickly and with more advanced technology
to result in faster hardware. The cost of this is more complicated
software. Most of the complexity is hidden by the compiler -- but you can't
necessarily hide everything. Remember that POSIX took advantage of some
loopholes in the ANSI C specification around external calls to proclaim that
you can do threaded programming in C without requiring expensive and awkward
hacks like "volatile". Still, the interpretation of ANSI C semantics is
stretched to the limit. The situation would be far better if a future
version of ANSI C (and C++) *did* explicitly recognize the requirements of
threaded programming.

> > If you stick to int or long, you'll probably be safe. If you use anything
> > smaller, be sure they're not allocated next to each other unless they're under
> > the same lock.
>
> Actually, you can be pretty sure that a compiler will split two declarations
> like:
>         char dataA;
>         char dataB;
> to be in two separate natural machine words.  It is much faster and easier for
> those RISC processors to digest.  However if you declare something as:

While that's certainly possible, that's just a compiler optimization
strategy. You shouldn't rely on it unless you know FOR SURE that YOUR
compiler does this.

>         char data[2]; /* or more than 2 */
> you have to be VERY concerned with the effects of word tearing since the
> compiler will certainly pack them into a single word.

Yes, this packing is required. You've declared an array of "char" sized
data, so each array element had better be allocated exactly 1 char.

> > I wrote a long post on most of the issues brought up in this thread, which
> > appears somewhere down the list due to the whims of news feeds, but I got
> > interrupted and forgot to address this issue.
>
> Yep, I saw it.  It was helpful.  So was the later post by someone else who
> included a link to a DEC alpha document that explained what a memory barrier
> was in this context.  I've seen three different definitions over the years.
> The definition you described in your previous post agreed with the DEC alpha
> description... That a memory barrier basically doesn't allow out of order
> memory accesses to cross the barrier.  A very important issue if you are
> implementing mutexes or semaphores :)[...]
>
> However, I really believe that dataA and dataB should both be declared as
> "volatile" to prevent the compiler from being too aggressive on it's
> optimization.  The mutex still doesn't guarantee that the compiler hasn't
> cached the data in an internal register across a function call.  My memory
> isn't perfect, but I do think this bit me on IRIX.

The existence of the mutex doesn't require this, but the semantics of POSIX
and ANSI C do require it. Remember that you lock a mutex by calling a
function, passing an address. While an extraordinarily aggressive C compiler
with a global analyzer might be able to determine reliably that there's no
way that call could access the data you're trying to protect, such a
compiler is unlikely -- and, if it existed, it would simply violate POSIX
1003.1-1996, failing to support threads.

You do NOT need volatile for threaded programming. You do need it when you
share data between "main code" and signal handlers, or when sharing hardware
registers with a device. In certain restricted situations, it MIGHT help
when sharing unsynchronized data between threads (but don't count on it --
the semantics of "volatile" are too fuzzy). If you need volatile to share
data, protected by POSIX synchronization objects, between threads, then your
implementation is busted.

> > There are, of course, no absolute guarantees. If you want to be safe and
> > portable, you might do well to have a config header that typedefs
> > "smallest_safe_data_unit_t" to whatever's appropriate for the platform. Then
> > it's just a quick trip to the hardware reference manual when you start a port.
> > On a CISC, you can probably use "char". On most RISC systems, you should use
> > "int" or "long".
>
> There never are guarantees are there :)

To reiterate again one more time, ( ;-) ), the correct (ANSI C) portable
type for atomic access is sig_atomic_t.

> > Yes, this is one more complication to the process of threading old code. But
> > then, it's nothing compared to figuring out which data is shared and which is
> > private, and then getting the locking protocols right.
>
> But what fun would it be if it wasn't a challenge :)

Well, yeah. That's my definition of "fun". But not everyone's. Sometimes
"boring and predictable" can be quite comforting.

> However, I would like to revist the original topic of whether it is "safe" to
> change a single byte without a mutex.  Although, instead of "byte" I'd like to
> say "natural machine word" to eliminate the word tearing and non-atomic memory
> access concerns.  I'm not sure it's safe to go back to the original topic, but
> what the heck ;)

sig_atomic_t.

> If you stick to a "natural machine word" that is declared as "volatile",
> you do not absolutely need a mutex (in fact I've done it).  Of course, there are
> only certain cases where this works and shouldn't be done unless you really know
> your hardware architecture and what you're doing!  If you have a machine with a
> lot of processors, unnecessarily locking mutexes can really kill parallelism.
>
> I'll give one example where this might be used:
>
> volatile int stop_flag = 0;  /* assuming an int is atomic */
>
> thread_1
> {
>         /* bunch of code */
>
>         if some condition exists such that we wish to stop thread_2
>                 stop_flag = 1;
>
>         /* more code - or not :) */
> }
>
> thread_2
> {
>         while(1)
>         {
>                 /* check if thread should stop */
>                 if (stop_flag)
>                         break;
>
>                 /* do whatever is going on in this loop */
>         }
> }
>
> Of course, this assumes the hardware has some sort of cache coherency
> mechanism.  But I don't believe POSIX mutex's or memory barriers (as
> defined for the DEC alpha) have any impact on cache coherency.

If a machine has a cache, and has no mechanism for cache coherency, then it
can't work as a multiprocessor.

> The example is simplistic, but it should work on a vast majority of
> systems.  In fact the stop_flag could just as easily be a counter
> of some sort as long as only one thread is modifying the counter...

In some cases, yes, you can do this. But, especially with your "stop_flag",
remember that, if you fail to use a mutex (or other POSIX-guaranteed memory
coherence operation), a thread seeing stop_flag set CANNOT assume anything
about other program state. Nor can you ensure that any thread will see the
changed value of stop_flag in any particular bounded time -- because you've
done nothing to ensure memory ordering, or coherency.

And remember very carefully that bit about "as long as only one thread is
modifying". You cannot assume that "volatile" will ever help you if two
threads might modify the counter at the same time. On a RISC machine,
"modify" still means load, modify, and store, and that's not atomic. You
need special instructions to protect atomicity across that sequence (e.g.,
load-lock/store-conditional, or compare-and-swap).

Am I trying to scare you? Yeah, sure, why not? If you really feel the need
to do something like this, do yourself (and your project) the courtesy of
being EXTREMELY frightened about it. Document it in extreme and deadly
detail, and write that documentation as if you were competing with Stephen
King for "best horror story of the year". I mean to the point that if
someone takes over the project from you, and doesn't COMPLETELY understand
the implications, they'll be so terrified of the risk that they'll rip out
your optimizations and use real synchronization. Because this is just too
dangerous to use without full understanding.

There are ways to ensure memory ordering and coherency without using any POSIX
synchronization mechanisms, on any machine that's capable of supporting POSIX
semantics. It's just that you need to be really, really careful, and you need to be
aware that you're writing extremely machine-specific (and therefore inherently
non-portable) code. Some of this is "more portable" than others, but even the
"fairly portable" variants (like your stop_flag) are subject to a wide range of
risks. You need to be aware of them, and willing to accept them. Those who aren't
willing to accept those risks, or don't feel inclined to study and fully understand
the implications of each new platform to which they might wish to port, should
stick with mutexes.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP=============
 Q163: Can ps display thread names?


>  Is there a way to display the name of a thread (specified thanks
>  to the function pthread_setname_np) in commands such a ps ?
>  (in order to quickly see the behavior of a well-known thread).
>  If it is not possible with Digital UNIX's ps, someone may
>  have hacked some interesting similar utilities that display
>  such thread informations ?

The ps command is a utility to show system information, and this would be
getting into an entirely different level of process information. It would,
arguably, be "inappropriate" to do this in ps. In any case, the decision
was made long ago to not do as you suggest.

The easiest way to get this information is to attach to the process with
ladebug (or another suitable user-level-thread-enabled debugger) and ask
for the information. (E.g., ladebug's "show thread" command.)

While one certainly could create a standalone utility, you'd need to find
the binary associated with the process, look up symbols, use /proc (or
equivalent) to access process memory, and so forth -- sounds a lot like a
debugger, doesn't it?

The mechanism used to access this information is in libpthreaddebug.so. As
of 4.0D, the associated header file,  is available on the
standard OS kit (with the other development header files). Although it's
not externally documented, it's heavily commented, and reasonably
self-describing.

=================================TOP=============
 Q164: (Not!) Blocking on select() in user-space pthreads.

Subject: Re: Blocking on select() in user-space pthreads under HP/UX 10.20 

David Woodhouse wrote:

> HP/UX 10.20 pthreads as are implemented as user-space as opposed to
> kernel. I've heard rumors that a user-space thread that blocks on
> select() actually blocks all other threads within that process (ie the
> entire process). True of false?

The answer is an absolute, definite, unambigous... maybe.

Or, to put it another way... the answer is true AND false.

However, being in a generous (though slightly offbeat) mood today, I'll go a
little further and explain the answer. (Remember how the mice built Deep
Thought to find the answer to "Life, the Universe, and Everything", and it
came back with the answer "42", so they had to go off and build an entirely
different computer to find the question, which was "what is 9 times 6",
resulting in a third attempt, this time to find out what the question and
answer MEANT?)

Anyway, any blocking kernel call, including select, will indeed block the
process. However, if built correctly, a DCE thread (that's the origin of the
thread library on 10.20) application will never directly call select.
Instead, its calls will be redirected to the user-mode thread library's
"thread-synchronous I/O" package. This package will attempt a NON-blocking
select, and, if it would have needed to block (none of the files are
currently ready), the thread will be blocked on a condition variable "until
further notice". At various points, the library polls (with select) and
awakens any thread waiting for a file that's now ready. When all application
threads are blocked somewhere, the thread library blocks the process in
select, with a mask that OR-s all the masks for which any thread is waiting,
and with a timeout representing the next user-mode timer
(pthread_cond_timewait, select, whatever).

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP=============
 Q165: Getting functional tests for UNIX98 

> Dave Butenhof wrote somewhere that there were
> a set of functional tests for UNIX98, that could
> also work with POSIX. Any idea where I could find
> it?

The place to look is The Open Group. Start with http://www.xopen.org/.
(Unfortunately I don't have a bookmark handy for the test suite, and I
can't get to xopen.org right now; so you're on your own from here. ;-))

=================================TOP=============
 Q166: To make gdb work with linuxthreads?  

Are there any ongoing work or plans to make gdb work with linuxthreads?
>
>- Erik

Yes, there's a project at the University of Kansas called SmartGDB that is 
working on support for user-level and kernel-level threads.  There is 
already support for several user level thread packages and work is 
currently being done on the linuxthreads support.  The URL is:
    http://hegel.ittc.ukans.edu/projects/smartgdb/

We have most of the kernel modifications done required to support it and 
are working on the rest of the changes to gdb.  At this point, I can't 
even guess on a release date, but you can check the web page for more 
information on what's been done so far.  The email contact is 
[email protected].

Robert
=================================TOP=============
 Q167: Using cancellation is *very* difficult to do right...  

Bil Lewis wrote:

> Dave Butenhof wrote:
> > >   Using cancellation is *very* difficult to do right, and you
> > > probably don't want to use it if there is any other way you can
> > > accomplish your goal.  (Such as looking at a "finish" flag as you
> > > do below.)
> >
> > I don't agree that cancellation is "very" difficult, but it does
> > require understanding of the application, and some programming
> > discipline. You have to watch out for cancellation points, and be
> > sure that you've got a cleanup handler to restore shared data
> > invariants and release resources that would otherwise be stranded if
> > the thread "died" at that point. It's no worse than being sure you
> > free temporary heap storage, or unlock your mutexes, before
> > returning from a routine... but that's not to say that it's trivial
> > or automatic. (And I'm sure we've never gotten any of those things
> > wrong... ;-) )
>
>   Dave has written 10x as much UNIX code as I have, so our definitions
> of "very difficult" are distinct.  (I've probably been writing MP code
> longer tho...  I built my first parallel processor using two PDP/8s
> back in '72.  Now THERE was a RISC machine!  DEC could have owned the
> world if they'd tried.  I remember when...)

Yeah, PDP-8 was a pretty good RISC, for the time. Of course it needed
virtual memory, and 12 bits would now be considered a rather "quirky"
word size. But, yeah, those could have been fixed.

Oh yeah... and we DID own the world. We just let it slip out of our
hands because we just laughed when little upstarts said they owned it.
(Unfortunately, people listened, and believed them, and eventually it
came to be true.) ;-) ;-)

>   It's that bit "to restore shared data invariants". Sybase, Informix,
> Oracle, etc. spend years trying to get this right.  And they don't
> always succeed.

It's hard to do hard things. Yeah, when you've got an extremely
complicated and extremely large application, bookkeeping gets more
difficult. This applies to everything, not just handling cancellation.
Just as running a multinational corporation is harder than running a
one-person home office. The point is: the fact that the big job is hard
doesn't mean the small job is hard. Or, you get out what you put in. Or
"thermodynamics works". Or whatever.

>   And don't forget to wait for the dead threads.  You can't do
> anything with the shared data until those have all been joined,
> because you can't be sure when they actually die.

That's untrue, as long as you use proper synchronization (or maybe "as
long as you use synchronization properly"). That's exactly why the mutex
associated with a condition wait is re-locked even when the wait is
cancelled. Cleanup code needs (in general) to restore invariants before
the mutex can be safely unlocked. (Note that while the data must be
"consistent", at the time of the wait, because waiting has released the
mutex, it's quite common to modify shared data in some way associated
with the wait, for example, adding an item to a queue; and often that
work must be undone if the wait terminates.)

You only need to wait for the cancelled thread if you care about it's
return value (not very interesting in this case, since it's always
PTHREAD_CANCELED, no matter how many times you look), or if you really
want to know that it's DONE cleaning up (not merely that the shared data
is "consistent", but that it conforms to some specific consistency -- an
attempt that I would find suspicious at best, at least if there might be
more than the two threads wandering about), or if you haven't detached
the thread and want to be sure it's "gone".

>   Conclusion: Dave is right (of course).  The definition of "very" is
>   up for grabs.

The definition of the word "very" is always up for grabs. As Samuel
Clemens once wrote, when you're inclined to use the word "very", write
"damn" instead; your editor will remove it, and the result will be
correct.

Sure, correct handling of cancellation doesn't happen automatically.
Neither does correct use of threads, much less correct use of the arcane
C language (and if C is "arcane", what about English!?) Somehow, we
survive all this.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q168: Why do pthreads implementations differ in error conditions?  
 

[email protected] wrote:

> I'd like to understand why pthreads implementations from different
> vendors define error conditions differently.  For example, if
> pthread_mutex_unlock is called for a mutex that is not owned by the
> calling thread.
>
>    Under Solaris 2.5:  "If the calling thread is not the owner of the
>    lock, no error status is returned, and the behavior of the program
>    is undefined."
>
>    Under AIX 4.2:  It returns the EPERM error code.
>
> The problem may be that the AIX 4.2 implementation is based on draft 7
> of the pthreads spec, not the standard, but I certainly prefer the AIX
> approach.
>
> Another example:  pthread_mutex_lock is called for a mutex that is
> already owned by the calling thread.
>
>    Under Solaris 2.5: "If the current owner of a mutex tries to relock
>    the mutex, it will result in deadlock." (The process hangs.)
>
>    Under AIX 4.2: It returns the EDEADLK error code.
>
> Once again, the AIX approach certainly seems preferable.
>
> Aren't these issues clearly defined by the pthreads standard?  If not,
> why not?

Yes, these issues are clearly defined by the POSIX standard. And it's
clearly defined in such a manner that implementations are not required
to make the (possibly expensive) checks to report this sort of
programmer error -- but so that implementations that do choose to detect
and report the error must do so using a standard error code.

In this case, Solaris 2.5 chooses not to "waste" the time it would take
to detect and report your error, while AIX 4.2 does. Both are correct
and conform to the standard. (Although, as you pointed out, AIX 4.2
implenents an obsolete draft of the standard, in this respect it doesn't
differ substantially from the standard.)

The POSIX philosophy is that errors that are not under programmer
control MUST (or, in POSIX terms, "SHALL") be reported. Examples include
ENOMEM, and other resource shortages. You can't reasonably know that
there's not enough memory to create a thread, because you can't really
know how much you're using, or how much additional is required. On the
other hand, you can be expected to know that you've already locked the
mutex, and shouldn't try to lock it again. POSIX does not require that
an implementation deal gracefully with such programmer errors.

While it is nice to have a "debugging mode" where all programmer errors
are detected, in the long run it's more important to have a "production
mode" where such extremely critical functions as mutex lock and unlock
execute as quickly as possible. In general, the only way to do both is
to have two separate libraries. This complicates maintenance
substantially, of course -- but it also complicates application
development because the two libraries will have different timings, and
will expose different problems in the application design. Which means
you'll inevitably run into a case that only fails on the "production
library", and can't be debugged using the "debug library". That usually
means the development and maintenance costs of building and shipping two
thread libraries usually isn't worthwhile.

You're better off relying on other tools to detect this class of
programming error. For example, the Solaris "lock_lint" program, or
dynamic trace analysis tools that can check for incorrect usage at
runtime in your real program.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q169: Mixing threaded/non-threadsafe shared libraries on DU  
 
Claude Vallee wrote:

> Thanks Dave Butenhof for your excellent answer.  I just have a few
> complementary questions.
>
> + To make this clear, prior to 4.0D, your liba code running in the
> + main thread may see errno change at random. As long as liba doesn't
> + read errno, this shouldn't be a problem.
> +
>
> I found out I was using 4.0B.  Is errno the only the problem area of
> the c run time library?  What about other libraries like librt?

The base kit libraries should be thread-safe. There are so many, though,
that I'm afraid I can't claim personal knowledge that they all ARE
thread-safe. I also know of at least one case where a base kit library
was "almost" thread-safe, but had been compiled without -D_REENTRANT
(making it subject to errno confusion). Bugs, unfortunately, are always
possible. While it's not that hard to code a library for thread-safety,
it's nearly impossible to PROVE thread-safety -- because problems come
in the form of race conditions that are difficult to predict or provoke.

> + You do have to be aware of whether liba SETS the global errno --
> + because your thread-safe code won't see the global errno through any
> + normal mechanisms.
>
> What do you mean by that?  Yes, liba sets errno each time it calls a
> system service (the system service sets it actually).  If you're
> asking if it explicitely sets it, then no.  Are you asking if I am
> counting on setting errno from one thread and reading it from the
> other thread counting on the value to be the same?

Calling system services doesn't count. The libc syscall stubs that
actually make the kernel call DO handle errno correctly with threads.
(On the other hand, if your library runs in a threaded application and
isn't built correctly, you'll end up looking at the global errno while
the system/libc just set your thread errno. That's the point of my 4.0D
change -- at least if that non-thread-safe code runs in the initial
thread, it'll get the right errno.)

> By the way, seterrno(), does not seem to be a public service (it
> doesn't have a man page anyway, (I found _Seterrno() in errno.h, but I
> we're certainly not using it )).

You're right -- int _Geterrno() and _Seterrno(int) are the external
interfaces. I'd recommend compiling for the threaded errno rather than
using those calls, though.

> + No, _THREAD_SAFE doesn't do anything. It's considered obsolete. You
> + should use _REENTRANT. (Though I actually prefer the former term,
> + I've never felt it was worth arguing, or making anyone change a ton
> + of header files.)
>
> Ok, _THREAD_SAFE is out.  Then, if I define _REENTRANT when compiling
> all my sources, and I explicitely link with libpthread, libmach,
> libc_r, and all the reentrant versions of my libraries, will this
> achieve the same thing as using the "-pthread" option?  (Or am I
> playing with fire again?).

We document the equivalents, and it's perfectly legal to use them.
However, the actual list of libraries may change from time to time, and
you'll get the appropriate set for the current version, automatically,
when you link with "cc -pthread". Over time, using the older set of
libraries may leave you carrying around extra baggage. For example, your
reference to libc_r is long-since obsolete; and on 4.0D, threaded
applications no longer need libmach. While -pthread links automatically
stop pulling in these useless references, you still be carrying them
around, costing you extra time at each program load, as well as extra
virtual memory. Is it a big deal? That's up to you. But if you're using
a compiler that supports "-pthread", I'd recommend using it.

> + If liba is YOUR code, then please don't play this
> + game: build it at least with -D_REENTRANT.
>
> Yes, liba is our code... Actually, liba is a set of hundreds of
> libraries which take a weekend to build.  And most of our processes
> are not multithread.  What I was trying to achieve is to save on
> processing time (use non thread-safe libraries in single threaded
> processes), and to save on compile time (not building both single
> threaded and multi threaded versions of all the libraries).

If you just compile with -D_REENTRANT, you'll get thread-safe errno, but
that's only a small part of "thread safety". As long as it's only called
in one thread, that's probably OK. For more general thread-safety, with
very little performance impact on non-threaded processes, you might look
into the TIS interfaces (tis_mutex_lock, tis_mutex_unlock, etc.). You
can use these just the equivalent POSIX functions; in a threaded
process, they do the expected synchronization, but in a non-threaded
process they avoid the cost of interlocked data access and memory
barrier instructions, giving very low overhead. (TIS is better than
having pthread stubs in libc, because it works even when the main
program isn't linked against libpthread.)

> Thanks again for a thorough answer.  By the way, for some reason, I
> could never get the answer to my question from my news server (I got
> it from searching usenet through altavista), so please reply by email
> as well as through the newsgroup.

Yeah, news servers can be annoying creatures, with their own strange
ideas of what information you should be allowed to see. You really can't
trust them! ;-)

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q170: sem_wait() and EINTR  

W. Richard Stevens wrote:

> Posix.1 says that sem_wait() can return EINTR.  Tests on both Solaris 2.6
> and Digital Unix 4.0b show that both implementations do return EINTR when
> sem_wait() is interrupted by a caught signal.  But if you look at the
> (albeit simplistic) implementation of semaphores in the Posix Rationale
> that uses mutexes and condition variables (pp. 517-520) sem_wait() is
> basically:
>
>     int
>     sem_wait(sem_t *sem)
>     {
>             pthread_mutex_lock(&sem-;>mutex);
>             while (sem->count == 0)
>                     pthread_cond_wait(&sem-;>cond, &sem-;>mutex);
>             sem->count--;
>             pthread_mutex_unlock(&sem-;>mutex);
>             return(0);
>     }
>
> But pthread_cond_wait() does not return EINTR so this function will never
> return EINTR.  So I was wondering how existing implementations actually
> implement sem_wait() to detect EINTR.  Seems like it would be a mess ...
>
>         Rich Stevens


On Digital UNIX, sem_wait() turns into a system call, with the usual behavior
in regard to signals and EINTR. You can only implement POSIX semaphores via
pthreads interfaces if you have support for the PSHARED_PROCESS synch
attribute. Digital
UNIX won't have that support until an upcoming major release.
At that point, it's entirely possible that the POSIX semaphores
will be re-implemented to use the model you cite. I certainly
encouraged the maintainer to do so before I left...

__________________________________________________
Jeff Denham ([email protected])

Bright Tiger Technologies
SmartCluster� Software for Distributed Web Servers
http://www.brighttiger.com

=================================TOP=============
 Q171: pthreads and sprocs  

Peter Shenkin wrote:
> 
> Can someone out there say something about which is better to use
> for new code -- pthreads or sprocs -- and about what the tradeoffs
> are?

A good question.  I've done little with sprocs, but have used pthreads
on a 32p SGI Origin a good deal.

Sprocs are heavyweight processes; pthreads are MUCH lighter weight.

Sprocs have a considerable amount of O/S and programmer tool support 
for high performance programming; as yet, pthreads has almost none.

Sprocs lend themselves to good fine-grain control of resources
(memory, CPU choice, etc); as yet these strengths are largely lacking 
in SGI pthreads.

The project on which I work has bet the farm on the present and future 
high performance of pthreads and the results so far have been good.
However, we would dearly love for SGI and the rest of the parallel
programming community to better support pthreads as well as they have 
their former proprietary parallel programming models so that we can
control our threads as specifically as we could our sprocs and the like.

Not really a complaint; more of a strong request.

The upshot is, IMHO, go ahead and program in pthreads on SGIs.  The
performance gain you would have gotten from better control of your
sprocs is made up for in the "portability" of pthreads, the rosier
future of pthreads, and their more modest system resource use.

Not my company's official position, I should add.

--
Randy Crawford
[email protected]


=================================TOP=============
 Q172: Why are Win32 threads so odd?  

Bil Lewis  wrote in article
<[email protected]>...
>   You must know the folks at MS who did Win32 threads (I assume).

Bad assumption. I know of them, but don't know them and they don't know me.

> Some of the design sounds so inefficient and awkward to use, while
> other bits look really nice. 

My own opinion is that win32 grew from uni-thread to pseudo-multi-thread in
a very haphazard manner, basically features were added when they were found
to be needed.
I personally dislike the overall asymmetric properties of the API. Consider
the current problem of providing POSIX condition variables: if you could
release a single waiter on a manual reset event, or release all waiters on
an autoreset event then the problem would be much simpler to solve.
Consider also the ability to signalObjectAndWait() but no corresponding
signalObjectAndwaitMultiple() - another change that would make writing
various forms of CV's easier.

>   Are my views widely shared in the MS world?  And why did they
> choose that design?  Have they thought of adopting the simpler
> POSIX design (if not POSIX itself)?

No idea, sorry. Try asking Dave Cutler he was one of the main thread
architects AFAIK.

David

Bil Lewis wrote:

>   The real question is: "What the heck was Cutler (I assume?) thinking when he made
> Win32 mutexes kernel objects?"
>

Well, Dave has a history in this department. Consider his
Digital VAXELN realtime executive, a direct predecessor of
WNT. It was designed and written by Cutler (and some other
folks who later contributed to NT, like Darryl Havens)
to run on the MicroVAX I waaaaay back in the early '80s
at DECWest in Seattle. (Development soon moved to a
dedicated realtime group at the Mill in Maynard.)

VAXELN had processes (called jobs) and threads (called
processes) and kernel objects (PROCESS, DEVICE, SEMAPHORE,
AREA, PORT, MESSAGE). It ran memory-resident on embedded
VAXes(or VAX 9000s for that matter), and let you program
device drivers (or whatever) in Pascal, C, FORTRAN,
or Bliss even. (Pretty nifty little concurrent environment,
a little bit too ahead of its time for its own good, guess.)

The only synchronization object provided oringally was the
semaphore, which like the NT mutex, required a trip into
the kernel even for uncontested locking. This of course
proved to be too expensive for real-world concurrent
programming, so a library-based optimized synch. object
was developed.

It had a count and an embedded binary semaphore object
that could be locked quickly in user space through the
use of the VAX ADAWI interlocked-increment instruction.
A system call occurred only for blocking and wakeup on a
contested lock.

Sounds just like an NT critical section, huh?

Ironically, in VAXELN it was called a MUTEX! History
repeats itself, with only the names changed to protect
the guilty...

Jeff

=================================TOP=============
 Q173: What's the point of all the fancy 2-level scheduling??  
[email protected] wrote:

> In article <[email protected]>,
>   Jeff Denham  wrote:
> >
> > Boris Goldberg wrote:
> >
>
> > Seriously, I've been around this Solaris threads package long enough to
> > be wondering how often anyone is using PROCESS scope threads.
> > With everyone just automatically setting SYSTEM scope threads
> > to get the expected behavior, what's the point of all the fancy 2-level
> > scheduling??
>
> I think, it is better to use thr_setconcurrency to create #of processors
> + some additional number (for I/O bound threads) of LWPs rather than
> creating LWP for each thread. Can Dave Butenhof or somebody from
> Sun thread designer team please comment on this?

That's kinda funny, since I've got no connection with Sun.

The real problem is that Sun's "2 level scheduling" really isn't at all like
2-level scheduling should be, or was intended to be. There's a famous paper
from the University of Washington on "Scheduler Activations" (one of Jeff
Denham's replies to this thread mentioned that term, so you may have noticed
it), which provides the theoretical basis for modern attempts at "2 level
scheduling". Both Sun and Digital, for example, claim this paper as the
basis for our respective 2-level scheduling models.

However, while we (that's Digital, not Sun) began with a model of 2-way
communication between kernel and user schedulers that closely approximates
the intended BEHAVIOR (though not the detailed implementation) of scheduler
activations, I have a hard time seeing anything usefully similar in Solaris.
They have a signal to save the process from total starvation when the final
thread blocks in the kernel (by giving the thread library a chance to create
another LWP). We automatically generate a new "replacement VP" so that the
process always retains the maximum level of concurrency to which its
entitled.

The advantages of 2-level scheduling are in performance and scaling.

  1. Scaling. Kernel threads are kernel resources, and (as with processes),
     there are strict limits to how many a kernel can handle. The limits are
     almost always fixed by some configuration process, and additionally
     limited by user quotas. Why? Because they're expensive -- not just to
     the process that uses them, but to the system as a whole. User-mode
     threads, on the other hand, are "just" virtual memory, and, in
     comparison, "dirt cheap". So you can create a lot more user threads
     than kernel threads. Yeah, the user threads can't all run at the same
     time... but neither can the kernel threads, because the number of
     processors (being a relatively expensive HARDWARE resource) is even
     more limited. The point is to balance the number of kernel threads
     against the "potentially parallelism" of the system (e.g., the number
     of processors), while balancing the number of user threads against the
     "potential concurrency" of the process (the maximum parallelism plus
     the maximum number of outstanding I/O operations the process might be
     able to undertake). [On Solaris, you do this manually by creating LWPs
     -- either by creating BOUND (SCS) threads, or by calling
     thr_setconcurrency. On Digital UNIX, this is done for you automatically
     through the integration between user and kernel level schedulers.]
  2. Performance. In many typical workloads, most of the communication is
     between threads within the process. Synchronization involves mutexes
     and condition variables. A 2-level scheduler can optimize these types
     of synchronization, and the resulting context switches, without
     involving the kernel at all. A kernel thread needs to call into the
     kernel to block -- and then another thread needs to call into the
     kernel again to unblock it. A user thread (or a 2-level thread blocking
     in user mode) only needs to call into the thread library. Because a
     call into the kernel is more expensive than a call within the process
     (and usually LOTS more expensive), this can save a lot of time over the
     life of a process.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q174: Using the 2-level model, efficency considerations, thread-per-X  
 
Bil Lewis wrote:

> My experience (which is fairly broad), and my opinion (which is
> extensive) is that THE way to do this is this:

Yes, Bil, your experience is extensive -- but unfortunately mostly limited
to Solaris, which has a poor 2-level scheduling design. (Sorry, guys, but
it's the truth.)

>   Use only System scoped threads everywhere.  On your selected
> release platforms, tune the number of threads for each configuration
> that's important.  (Take educated guesses at the rest.)  For servers,
> do everything as a producer/consumer model.  Forget Thread-per-request,
> thread-per-client, etc.  They are too hard to tune well, and too
> complex.

No, use SCS ("system scoped") threads only where absolutely necessary. But,
yeah, in recognition of your extensive experience, I would acknowledge that
on Solaris they are (currently) almost always necessary. (Solaris 2.6 is
supposed to be better than 2.5, though I haven't been able to try it, and I
know that Solaris developers have hopes to make substantial improvements in
the future.) This is, however, a Solaris oddity, and not inherent in the
actual differences between the PCS ("process scoped") and SCS scheduling
models.

On the rest, though -- I agree with Bil that you should avoid "thread per
request" in almost all cases. Although it seems like a simple extension
into threads, you usually won't get what you want. This is especially true
if your application relies on any form of "fairness" in managing the
inevitable contention between clients, because "thread per request" will
not behave fairly. You'll be tempted to blame the implementation when you
discover this, but you'll be wrong -- the problem is in the application.
The best solution is to switch to a client/server (or "producer/consumer")
model, where you control the allocation and flow of resources directly.

>   Process scoped threads are good for a very small number of unusual
> examples (and even there I'm not totally convinced.)

On the contrary, PCS threads are best except for the very few applications
where cross-process realtime scheduling is essential to correct operation
of the application. (E.g., for direct hardware access.)

>   Simplicity rules.

Right. (Fully recognizing the irony of agreeing with a simplistic statement
while disagreeing with most of the philosophy behind it.)

>   Logic:  Process scope gives some nice logical advantages in design,
> but most programs don't need that.  Most programs want to run fast.
> Also, by using System scoped threads, you can monitor your LWPs,
> knowing which is running which thread.

SCS gives predictable realtime scheduling response across the system, but
most programs don't need that. Most programs want to run fast, and you'll
usually get the most efficient execution, and the best management of system
resources, by using PCS threads. "Monitoring" your LWPs might be
comforting, but probably provides no useful understanding of the
application performance. You need an analysis tool that understands the
CONTENTION between your execution contexts, not merely the identity of the
execution contexts. Such an analysis tool can understand PCS threads as
well as SCS threads.

>   Anywhere you're creating threads dynamically, you need to know
> how many threads you're creating and ensure you don't create too
> many.  (Easy to mess up!)  By using a P/C model, you create exactly
> the right number of threads (tuned to machine, CPUs, etc.) and don't
> have to think about them.  If you run at lower than max capacity, having
> a few idle threads is of very little concern.

Remember that "concurrency" is much more useful to most applications than
"parallelism", and is harder to tune without detailed knowledge of the
actual workload. When you're doing I/O, your "concurrency width" is often
far greater than your "execution width". It's often useful, for example, to
dynamically increase the number of "servers" in a process to balance a
large workload, because each server might be blocked on one client's
request for a "long time". Dynamic creation isn't necessarily tied to a
simplistic "thread per request" model.

>   Opinions may be worth what you paid for 'em.

No, no. Opinions are hardly ever worth as much as you paid for them, and
usually a good deal less. Information, however, may be worth far more. One
might hope that in the process of airing our worthless opinions, we have
incidentally exposed some information that might help someone! ;-)

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q175: Multi-platform threading api  

From: "Constantine Knizhnik" 
 

I also have developed threads class library, which provides three system
dependent implementations: based on Posix threads, Win32 and portable
cooperative multitasking using setjmp() and longjmp(). This library was used
in OODBMS GOODS for implementing server and client parts. If somebody is
intersted in this library (and in OODBMS itself), it can be found at

http://www.ispras.ru/~knizhnik


Jason Rosenberg wrote in message <[email protected]>...
>Hello,
>
>I am tasked with converting a set of large core C libraries
>to be thread-safe, and to use and implement a multi-platform
>api for synchronization.  We will need solutions for
>Solaris 2.5.1, Digital Unix 4.0B(D?), Irix 6.2-6.5,
>HP-UX 10.20, AIX 4.1.2+, Windows NT 4.0(5.0) and Windows 95(98).
>
>I have built a basic wrapper api which implements a subset
>of pthreads, and have it working under Digital Unix 4.0B,
>Irix 6.2 (using sprocs), and Windows NT/95.  I am in the
>process of getting it going on the other platforms.


=================================TOP=============
 Q176: Condition variables on Win32   
 

Hi,

> I don't see that this is justifiable.

Have you ever seen any of the tortuous attempts by bright fellows like
Jeffrey Richter to define relatively simple abstractions like
readers/writer locks using the Win32 synchronization primitives?  It's
not pretty...  In fact, the first several editions of his Advanced
Windows book were full of bugs (in contrast, we got that right in ACE
using CVs in about 20 minutes...).  If someone of his calibre can't
get this stuff right IN PRINT (i.e., after extensive reviews) I don't
have much faith than garden variety Win32 programmers are going to
have a clue...

> It might be harder if you think in terms of the POSIX facilities. I
> wouldn't say that the combination of a semaphore and mutex or
> critsec is hard though, and the inconvenience of having to acquire
> the mutex after waiting on the semaphore is balanced against
> checking for false wakeups and being unable to signal n resources.

Checking for false wakeups is completely trivial.  I'm not sure what
you mean by "signal n resources".  I assume you're referring to the
fact that condition variables can't be used directly in a
WaitForMultipleObjects()-like API.  Though clearly you can broadcast
to n threads.

> I had in mind that it is allowed to notify more than one thread
> (which always seemed odd to me) but I don't have my POSIX spec
> handy.  Just a nit, but it does stress that false wakeups must be
> handled.

I had it this way originally, but others (hi David ;-)) pointed out
that this was confusing, so I'm grudgingly omitting it from this
discussion.  I suppose anyone who really wants to understand POSIX CVs
ought to read more comprehensive sources (e.g., Bil's book) than my
paper!

> I wish.  In fact they store details about the owning thread and
> support recursive acquisition.  I think this was a screwup by the NT
> designers - critical sections are needlessly expensive.  For very
> basic requirements (I believe) you can get a performance gain using
> the InterlockedIncrement et al for spin locks and an auto reset
> event to release waiters.  (Not sure I can prove it at the moment
> though.  If it makes a measurable difference, you have bigger
> problems than the efficiency of mutexes)

BTW, there's been an interesting discussion of this on the
comp.programming.threads newsgroup recently.  You might want to check
this out.

> This occurs in several examples from 3.1 on.  (Most of them? ...).
> Clear copy-paste bug I'd guess.  Better change to
> ReleaseMutex(external_mutex).

Hum, I'm not sure why you say this.  In all these cases the
"pthread_mutex_t" is typedef'd to be a CRITICAL_SECTION.  Am I missing
something here?

> I think this is rather optimistic.  There is no guarantee that any
> of them will release the mutex in a timely fashion. My original
> objection was to a solution that used a semaphore or other count of
> tokens, and that one thread could loop quickly and steal several
> tokens, leaving threads still blocked.

Right, that was the original discussion that triggered this paper.

> >  EnterCriticalSection (external_mutex);
> Another copy-paste bug.  WaitForSingleObject?

Can you please point out where you think these problems are occurring?
As far as I can tell, everything is typedef'd to be a CRITICAL_SECTION
(except for the SignalObjectAndWait() solution).

Take care,

        Doug
=================================TOP=============
 Q177: When stack gets destroyed relative to TSD destructors?  

Douglas C. Schmidt wrote:

>         Can someone please let me know if POSIX pthreads specifies
> when a thread stack gets destroyed relative to the time at which the
> thread-specific storage destructors get run?  In particular, if a
> thread-specific destructor accesses a pointer to a location on the
> run-time stack, will this memory still exist or will it be gone by the
> time the time the destructor runs?

Thread-specific data destructors must run in the context of the thread
that created the TSD value being destroyed. (This is clearly and
unambiguously implied by the standard. That is, while the standard
doesn't explicitly require this, an implementation that called
destructors in some other context would present a wide range of severe
restrictions in behavior that are not allowed by the standard.) Thus, the
stack must exist and remain valid at this point.

After a thread has terminated (having completed calling all cleanup
handlers and destructors), "the result of access to local (auto)
variables of the thread is undefined". E.g., at this point, (possibly
before a pthread_join or pthread_detach), the stack may have been
reclaimed.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q178: Thousands of mutexes?  

Peter Chapin wrote:

> I'm considering a program design that would involve, potentially, a large
> number of mutexes. In particular, there could be thousands of mutexes
> "active" at any one time. Will this cause a problem for my hosting
> operating system or are all the resources associated with a mutex in my
> application's address space? For example, in the case of pthreads, are
> there any resources associated with a mutex other than those in the
> pthread_mutex_t object? Is the answer any different for Win32 using
> CRITICAL_SECTION objects? (I know that there are system and process limits
> on the number of mutexes that can be created under OS/2... last I knew it
> was in the 64K range).

POSIX mutexes are usually user space objects, so the limit is purely based on
your virtual memory quotas. Usually, they're fairly small. Some obscure hardware
may require a mutex to live in special memory areas, in which case there'd be a
system quota -- but that's not relevent on any modern "mainstream" hardware.

On an implementation with 1-1 kernel threads (AIX 4.3, HP-UX 11.0, Win32), there
must be some "kernel component" of a mutex -- but this may be no more than a
generic blocking channel, with the actual synchronization occurring in
user-mode, so there may be no persistent kernel resources involved. Win32
mutexes are pure kernel objects -- critical sections, I believe, are user
objects with a kernel blocking channel (but I don't know whether the kernel
resource is persistent or dynamic).

Similarly, even on a 2-level scheduling implementation (Solaris, Digital UNIX,
and IRIX), a "process shared" mutex (a POSIX option that allows placing a mutex
in shared memory and synchronizing between processes) requires a kernel blocking
channel: but again, the persistent state may live completely in user-space. A
process private (default) mutex, on a 2-level scheduling implementation, is
almost certainly always a pure user-mode object.

Any more detailed answers will require knowing exactly what O/S (and version)
you intend to use.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q179: Threads and C++  

 
In article <[email protected]>,
  Dave Butenhof  wrote:
>
> Federico Fdez Cruz wrote:
>
> >         When I create a thread using pthread_create(...), can I ask to the
new
> > thread to execute a object method instead other "normal" function?
> >         I have tried this, but it seems like the thread doesn't know anything
> > about the object. If I write code in the method that doesn't use any
> > member of the object, all goes well; but If I try to access a member
> > function or a member in the object, I get a segmentation fault.
> >         I have seen that when the new thread is executing, "this" is NULL from
> > inside this thread.
>
> Yes, you can do this. But not portably. There's no portable C++ calling
standard.

[Snipped]

> My advice would be to avoid the practice. POSIX 1003.1c-1995 is a C language
> standard, and there are, in general, few portability guarantees when crossing
> language boundaries. The fact that C and C++ appear "similar" is, in some ways,
> worse than crossing between "obviously dissimilar" languages.

In fact, there are significant advantages in doing MT programming in C++.
All the examples in Butenhof D.'s book follow a model of cleaning up (C++
destructor) at the end of block or on error (C++ exception) and can be
written more elegantly in C++ using object (constructor/destructor) and
exceptions. Of course, you will read Butenhof's book for threads (and not
C++).



> Use a static non-member function for pthread_create, and have it call your member
> function. There's no need to bend the rules for a minor convenience, even when
> you can depend on an implementation where the direct call happens to work.
>

You can use the following method in C++ :

Thread.h
--------
class Thread
{
  pthread_t thread_;
public:
  typedef void* (*StartFunction)(void* arg);
  enum DetachedState { joinable, detached };

  Thread(StartFunction sf, void* arg, DetachedState detached_state);
  void join();
  ...
};
// use this to create a thead with a global function entry.
inline Thread Thr_createThread(Thread::StartFunction sf,
                               void* arg,
                               Thread::DetachedState ds)
{
   return Thread(sf, arg, ds);
}

// This is for C++ class member function (non-static)
template 
class ThrAction
{
public:
   typedef void* (T::*MemberFunction)();

   ThrAction(T* object, MemberFunction mf)
     : object_(object), mf_(mf) {}
   void* run() {
     return (object_->*mf_)();
   }
private:
   T* object_;
   MemberFunction mf_;
};

// an implementation class : notice the friend
template 
class ThreadImpl
{
  static void* start_func(void* arg) {
    return ((Action*)arg)->run();
  }
  friend Thread Thr_createThread(Action* action,
                                 Thread::DetachedState ds);
};

// You use this to create a thread using a member function
template
inline Thread Thr_createThread(Action* action,
                               Thread::DetachedState ds)
{
  return Thread(ThreadImpl::start_func, action, ds);
}

Now, you can use a C++ member function:

class MyClass
{
  typedef ThrAction Action;
  Action monitor_;
public:
  void* monitor() {  // this will be called in a new thread
    // ...
  }

  MyClass()
    : monitor_(this, monitor)
  {
    // start a thread that calls monitor() member function.
    Thread thr = Thr_createThread(&monitor;_, Thread::detached);
    ...
  }




Thread.cc
---------

struct Detached
{
  pthread_attr_t attr_;
  Detached() { pthread_attr_init(..); }
  ~Detached() { pthread_attr_destroy(..); }
};

static Detached detached_attr;

// depends on the static variable detached_attr; so make it
// out-of-line (in Thread.cc file).
Thread::Thread(StartFunction sf, void* arg, DetachedState ds)
{
  pthread_create(&thread;_, sf, arg,
                 ds == detached ? detached_attr.attr_, 0);
}


I wrote it in a hurry. I hope it helps.

- Saroj Mahapatra





In article <[email protected]>, Kaz Kylheku  wrote:
++ I second that! I also use the wrapper solution. Indeed, I haven't
++ found anyting in the C++ language definition which suggests that static member
++ functions are like ordinary C functions.

You are correct.  Most C++ compilers DO treat static member functions
like ordinary C functions, so it's usually possible to pass static C++
member functions as arguments to thread creation functions.  However,
some compilers treat them differently, e.g., the OpenEdition MVS C++
compiler doesn't allow static member functions to be used where
ordinary C functions are used, which is a PAIN.

BTW, if you program with ACE
(http://www.cs.wustl.edu/~schmidt/ACE.html) it hides all of this
madness from you so you can write a single piece of source code
that'll work with most C++ compilers.

Take care,

        Doug



I thought all those concerned with developing multi-threaded software
using the STL and C++ might be interested in the topic of STL and thread
safety.  I just bought the July/August 1998 issue of C++ Report and
within there is an article concerning the testing for thread safety of
various popular implementations of STL.  These are the published
results:

 STL implementation     Thread-safe?
 ------------------     ------------
 Hewlett-Packard            NO
 Rogue Wave                 NO
 Microsoft                  NO
 Silicon Graphics           YES
 ObjectSpace                YES

========================  But:   ================

You've missed rather a lot of discussion on this topic in the intervening
months. A few supplemental facts:

1) The definition of ``thread safe'' while not unreasonable is also not
universal. It is the working definition promulgated by SGI and --surprise! --
they meet their own design requirement.

2) Hewlett-Packard provided the original STL years ago, at a time when
Topic A was finding a C++ compiler with adequate support for templates.
Thread safety was hardly a major design consideration.

3) Rogue Wave was quick to point out that the version of their code
actually tested was missing a bug fix that had been released earler.
The corrected code passes the tests in the article.

4) Microsoft has been the unfortunate victim of some messy litigation.
(See web link in footer of this message.) If you apply the fixes from:

http://www.dinkumware.com/vc_fixes.html

then VC++ also passes the tests in the article. Its performance also
improves dramatically.

The C++ Report ain't Consumer Reports. Before you buy on the basis
of an oversimplified table:

a) Make sure your definition of ``thread safety'' agrees with what the
vendor provides.

b) Make sure you get the latest version of the code.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com/hot_news.html
-- 





> and I would like to implement the functionality of the jave interface
> runnable. Actually I suppose both of the following questions are
> really c++ questions, but I'm more optimistic to find the competent
> audience for my problems here...
>
> My first question is: Why does it make a difference in passing the
> data Argument to pthread_create, whether the function is a member
> function or not?

Because a member function is not a C function. A member function pointer
is a combination of a pointer to the class, and the index of the function
within the class. To make a member function "look" like a C function,
you must make it a static member function.

> In the following code I had to decrement the data pointer address
> (&arg;) by one. This is not necessary, if I define run() outside of any
> class.

This is dangerous!! One another compiler, a different platform, or
the next release of the same compiler, this may or may not work.
It is really a happy accident that it does work in this case.The canonical
form for C++ is to pass the address of a static
member function to pthread_create(), and pass the address
of the object as the argument parameter to pthread_create().
the static member function then calls the non-static member
by casting the void *arg to the class type.

[...]


> The second question is: I've tried to use the constructor of Thread to
> start run() of the derived class as the thread. For this I've
> implemented run() in the base class as pure virtual. But I didn't
> succeed because the thread always tried to run the pure virtual base
> function. Why is this?

Because during the constructor of the base class, the object *is*a base
object. It doesn't know that you have derived something
else from it. It is not possible to call a derived class's virtual
functions during construction or destruction of a base class.

[...]

Here is a minimalist emulation of the Java Runnable and Thread
interface: no error checks, many routines left out, no thread groups
and so on.

-----------------------------------------------------------------------

#include 

// ----------------------------------------------------------------------

class Runnable
{
 public:
  virtual ~Runnable();
  virtual void run() = 0;
};

Runnable::~Runnable()
{
}

// ----------------------------------------------------------------------

class Thread
{
 public:
  Thread();
  Thread(Runnable *r);
  virtual ~Thread();

  void start();
  void stop();
  void join();

  virtual void run();

 private:
  static void *startThread(void *object);
  void runThread();

 private:
  Runnable *target;
  // 0=not started, 1=started, 2=finished
  int state;
  pthread_t thread;
};

Thread::Thread()
  : target(0),
    state(0)
{
}

Thread::Thread(Runnable *r)
  : target(r),
    state(0)
{
}

Thread::~Thread()
{
  if (state == 1)
    pthread_cancel(thread);
  pthread_detach(thread);
}

void Thread::start()
{
  pthread_create(&thread;, 0, &Thread;::startThread, this);
}

void Thread::stop()
{
}

void Thread::join()
{
  void *value = 0;
  pthread_join(thread, &value;);
}

void Thread::run()
{
}

void *Thread::startThread(void *object)
{
  Thread *t = (Thread *) object;
  t->runThread();
  return 0;
}

void Thread::runThread()
{
  state = 1;
  if (target)
    target->run();
  else
    run();
  state = 2;
}

// ----------------------------------------------------------------------

#include 

class Test : public Runnable
{
 public:
  void run();
};

void Test::run()
{
  printf("thread run called\n");
}

int main(int argc, char *argv[])
{
  Thread t(new Test);
  t.start();
  printf("thread started\n");
  t.join();

  return 0;
}

> I've run into a trouble when I found out that when I cancel a thread via
> pthread_cancel() than destructors for local object do not get called.
> Surprising :). But how to deal with this? With a simple  thread code
> it would not be a big problem, but in my case it's fairly complex code,
> quite a few STL classes etc. Has someone dealt with such problem and is
> willing to share his/her soltution with me ? I thought I could 'cancel'
> thread via pthread_kill() and raise an exception within a signal handler
> but it's probably NOT very good idead, is it?;)
> Thank you,

  Unfortunately, not surprising.  C++ has not formally decided what to do with
thread cancellation, so it becomes compiler-specific.  The Sun compiler (for 
example) will run local object destructors upon pthread_exit() (hence 
cancellation also).  Others may not.

  I suppose the best GENERAL C++ solution is:

    a) Don't use stack-allocated objects.
    b) Don't use cancellation.

  Otherwise you can simply insist on a C++ compiler that runs the destructors.



--
Ian Johnston, FX Architecture, UBS, Zurich

=================================TOP=============
 Q180: Cheating on mutexes  


Hi all!  Howz things goin?  Just got back from the COOTS conference
where I learned all sorts of valuable lessons ("Don't try to match
Steve Vinoski drink for drink", "Snoring during dull presentations
is not appreciated").  As to this question...

[NOTE: SINCE THIS WAS WRITTEN, SOME THINGS HAVE CHANGED AND THE 
ADVICE BELOW IS NO LONGER VALID. THE ISSUE IS THAT WITH SOME MODERN CPUS
IT IS POSSIBLE THAT A VARIABLE WHICH WAS SET IN CPU #1 WILL NOT BE 
VISIBLE TO CPU #2. SO DON'T GET TRICKY AND USE THE MUTEX!)

  Pretty much everybody's been largely correct, but a little excessive.

  If we define the objective to be "I have a variable which will under
go a single, atomic state change, can I test it without a mutex?"
then the answer is "yes, if you do things right."  In particular,
if you want to go from FALSE to TRUE, and you don't care if you see
the change synchroniously, then you're OK.

  This is how spin locks work, this is how pthread_testcancel works
(at least on Solaris), and both Dave B & I talk about how to use 
this for pthread_once.

  With spin locks, we test the ownership bit until it becomes "free".
Then we do a trylock on it.  If somebody else gets it first, we go
back to spinning.

  With pthread_testcancel() we test the cancellation flag for our
thread w/o a lock.  If it ever becomes true, we exit.  (The setter
will set it to true under mutex protection, so that upon mutex unlock,
the value will be quickly flushed to main memory.)

  With pthread_once(), we'll insert a test BEFORE calling pthread_once,
testing a variable.  If it's true, then we know pthread_once has executed
to completion and we can skip the test.  If it's false, then we need to
run pthread_once(), which will grab the proper lock, and do the testing
under that lock, just in case someone else was changing it at that instant.

  So...  If you're very, very, careful and you don't mind missing the exact
point of initial change...  you can get away with it safely.

-Bil


> > ...  The real
> > trouble is that if you don't use some kind of synchronisation
> > mechanism, the update may not be seen at other CPUs *at all*.
> ...
> 
> Again donning my newbie hat with the point on top, why not?
> 
> For example, might a a pthreads implementation on a distributed-
> memory architecture not propagate global variables to the other
> CPUs at all, in the absence of something like a mutex?

=================================TOP=============
 Q181: Is it possible to share a pthread mutex between two distinct processes?  


>
> ie: some way to attach to one like you can attach to shared memory.
>
> Same question for condition variables.

The answer is (as often happens) both YES and NO. Over time, the balance
will shift strongly towards YES.

The POSIX standard provides an option known commonly as "pshared", which,
if supported on your implementation, allows you to allocate a
pthread_mutex_t (or pthread_cond_t) in shared memory, and initialize it
using an attributes object with a specific attribute value, such that two
processes with access to the shared memory can use the mutex or condition
variable for synchronization.

Because this is an OPTION in the POSIX standard, not all implementations
will provide it, and you cannot safely count on it. However, the Single
UNIX Specification, Version 2 (UNIX 98) requires that this POSIX option be
supported on any validated UNIX 98 implementation.

Implementations that provide the pshared option will define the
preprocessor symbol _POSIX_THREAD_PROCESS_SHARED in the  header
file.

For example,

     pthread_mutexattr_t mutattr;

     pthread_mutexattr_init (&mutattr;);
     pthread_mutexattr_setpshared (&mutattr;, PTHREAD_PROCESS_SHARED);
     pthread_mutex_init (&mutex;, &mutattr;);

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q182: How should one implement reader/writer locks on files?  

> How should one implement reader/writer locks on files?
> The locks should work accross threads and processes.

The only way to lock "a file" is to use the fcntl() file locking
functions. Check your man page. HOWEVER, there's a big IF... these locks
are held by the PROCESS, not by the THREAD. You can't use them to
control access between multiple threads within a process.

If you are interested in a mechanism outside the file system, you could
use UNIX98 read/write locks, with the pshared option to make them useful
between processes (when placed in shared memory accessible to all the
processes). However, UNIX98 read/write locks are not currently available
on most UNIX implementations, so you'd have to wait a while. Of course
you'd have to work out a way to communicate your shared memory section
and the address of the read/write lock(s) to all of the processes
interested in synchronizing. Also, while there are ways to make
fcntl() locking mandatory instead of advisory (at least, on most
systems), there's no way to do this with external locking.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q183: Are there standard reentrant versions of standard nonreentrant functions?  
 

| Certain standard functions found in C programming environments,
| such as gethostbyname, are not reentrant and so are not safe
| for use by multithreaded programs.  There appear to be two
| basic approaches to providing thread-safe versions of these
| functions:
| 
| (1) Reimplement the functions to use thread-local storage.
|     This is the approach that Microsoft has taken.  It's
|     nice because the interface is exactly the same so you
|     don't have to change existing code.

Can you cite documentation that Microsoft has done this consistently? 
(I'd love to be able to rely on it, but haven't been able to pin this
down anywhere.)
 
| (2) Provide alternate reentrant interfaces.  This is the
|     approach taken by (some/most/all?) Unix vendors.  The
|     reentrant version of the function has the same name
|     as the non-reentrant version plus the suffix _r.  For
|     example, the reentrant version of gethostbyname is
|     gethostbyname_r.
| 
| The big problem I'm having with approach (2) is that the
| reentrant versions are not the same across different Unixes.
| For example, the AIX 4.2 and Solaris 2.5 gethostbyname_r
| interfaces are much different (the Solaris interface is
| horrendous, I must say).

FYI, having dealy with this on a couple of Unix systems:  There's the
way Solaris does it, and the way everyone else does it.  While
"everybody else" may not be 100% consistent, the differences are pretty
minor, and often can be worked around with an appropriate typedef.

To be fair to Sun, Solaris probably got there first, and others chose to
do things differently; but that's the end result.  BTW, if you read the
man pages for things like gethostbyname_r, you'll find a notation that
says that the function is "provisional" or something like that, and may
go away in future releases.  There's no change through Solaris 2.6, and
some indication in the release notes that later Solaris versions will
play some games to support both the traditional Solaris API, and the
newer standards - whereever they are drawn from.)

|                          While this is par for the Unix
| course, I'm somewhat surprised that these interfaces are
| not specified by POSIX.  Or are they?  Is there some
| attempt underway to standardize?  Is there some set of
| _r functions that are specified by POSIX, and if so,
| where can I find this list?

*Some* of them were standardized.  Dave Butenhof's "Programming with
POSIX Threads" lists the following:  getlogin_r readdir_r strtok_r
asctime_r ctime_r gmtime_r localtime_r getgrpid_r getgrpnam_r getpwuid_r
getpwnam_r.  Also, a few functions (ctermid is an example) were declared
thread-safe if certain restrictions were followed.

None of the socket-related calls are on this list.  The problem, I
suspect, is that they were not in any base standard:  They're part of
the original BSD socket definition, and that hasn't made it into any
official standard until very recently.  As I recall, the latest Unix
specifications, like the Single Unix Specification (there really aren't
all that *many* of them, but the names change so fast I, for one, can't
keep up), do standardize both the old BSD socket interface, and the "_r"
variants (pretty much as you see them in AIX).

BTW, standardization isn't always much help:  localtime_r may be in the
Posix standard, but Microsoft doesn't provide it.  (Then again,
Microsoft doesn't claim to provide support for the Posix threads API, so
why would you expect it to provide localtime_r....)  You still have to
come up with system-dependent code.
                            -- Jerry
=================================TOP=============
 Q184: Detecting the number of cpus  
 

[email protected] wrote:
> 
> I have responding to my own posts but I forgot that NT also defines an
> environment variable, NUMBER_OF_PROCESSORS.  Win95/98 may do so as well.
> 
>                 Bradley J. Marker
> 
> In article <[email protected]>, [email protected] writes:
> 
> >Win95 only used a single processor the last I looked and there were no plans
> >for SMP for Win98 that I've heard.  I'd personally love SMP on 98 if it
> >didn't
> >cost performance too much.
> >
> >On NT you can get the processor affinity mask and count the number of bits
> >that are on.  Anybody have a better method?
> >
> >sysconf works on IRIX as well as Solaris.  sysconf(_SC_NPROC_CONF) or
> >sysconf(_SC_NPROC_ONLIN) (on Solaris it is NPROCESSORS instead of NPROC).
> >You
> >probably want on-line.  By the way, on an IRIX 6.4 Origin 2000 I am getting
> >sysconf(_SC_THREAD_THREAD_MAX) equal to 64.  Just 64 threads max?  I have
> >multithreaded test programs running with more threads than that (or they
> >seem
> >to be working, anyway).
> >
> >Anybody know how to control the number of processors the threads run on for
> >IRIX?  I'd like both the non-specific run on N processors case and the
> >specifically binding to the Nth processor case.  With Solaris I've been
> >using
> >thr_setconcurrency and processor_bind.
> >
> >Sorry but I don't know for Digital Unix, IBM AIX, or HP-UX.
> >
> >               Bradley J. Marker

In Win32, GetSystemInfo fills in a struct that contains a count of
the number of processors, among other things.

=================================TOP=============
 Q185: Drawing to the Screen in more than one Thread (Win32)  

Note: Followup-to: set to comp.os.ms-windows.programmer.win32

[long post removed, see the thread :-) ]

Maybe I'm wrong(*), but AFAIR:

    You can only draw in a window from the thread which own the
window (the one which creates the window). This is this thread which
receives all the messages targeted to the window into its thread
messages list (each thread receive messages for the window it
creates).
    For what I remember, it runs with TextOut() because the second
thread send a message to the first one (which own the window) which
then do the job.
    So if you use a locking mechanism between the two threads for
accessing the window, you may go to a deadlock (thread 2 waiting for
thread 1 painting, and thread 1 waiting for thread 2 releasing the
access).

    Maybe by defining a user message with adequate parameter, and
posting it (the second thread become then immediatly ready to continue
number crunching) into the thread 1 message list, you can achieve a
good update and a minimum thread 2 locking.

A+

Laurent.

=================================TOP=============
 Q186: Digital UNIX 4.0 POSIX contention scope  

> I recently found myself at the following website, which describes the
> use of pthreads under Digital Unix 4.0.  It is dated March 1996, so
> I am wondering how up to date it is.
>
> http://www.unix.digital.com/faqs/pub
> ications/base_doc/DOCUMENTATION/HTML/AA-Q2DPC-TKT1_html/thrd.html
>
> It refers to several unimplemented optional funtions from Posix
> 1003.1c 1995, including pthread_setscope.  So I am wondering, then,
> what sort of "scope" do dec pthreads have, are they all system level,
> or all process level, etc.

Digital UNIX 4.0 (through 4.0C) did not support POSIX contention scope.
It was just one of those things that "missed the cut". All POSIX threads
are process contention scope (PCS). Digital UNIX 4.0D supports the scope
attribute. (Since 4.0D has been shipping for some time, it appears that
the web link you've found is not up to date.)

On 4.0D, threads are PCS by default, (as they should be), but you can
create SCS (system contention scope) threads for the rare situations
where they're necessary. (For example, to share realtime resource
directly with hardware, or with OS threads, etc.)


=================================TOP=============
 Q187: Dec pthreads under Windows 95/NT?  
> Also, appendix C refers to dec pthreads under Windows 95/NT.  Is that
> a reality?

Depends on what you mean by "reality". Yes, we have DECthreads running
on Win32 "in the lab", and have for some time. In theory, given
sufficient demand and certain management decisions regarding pricing and
distribution mechanism, we could ship it. Those decisions haven't yet
been made. (If you have input on any of this, send me mail; I'd be glad
to forward it to the appropriate person. If you can say whether you're
"just curious" or "want to buy" [and particularly if you can say how
much you'd pay], that information would be useful.)


=================================TOP=============
 Q188: DEC current patch requirements  

> It also doesn't describe the current patch requirements, etc., for
> 4.0B.

The Guide to DECthreads is a reference manual, not "release notes", and
is not updated routinely for patch releases. The version you're reading
is clearly for 4.0 through 4.0C, and there's a new version for 4.0D. We
still haven't managed to find time to push through the sticky fibres of
the bureaucracy to get a thread project web page, on which we could post
up-to-date information like current patches and problems.

In general, you should just keep up with the latest patch kit. You can
always keep an eye on the patch FTP directory for your release, under

      ftp://ftp.service.digital.com/public/Digital_UNIX/


=================================TOP=============
 Q189: Is there a full online version of 1003.1c on the web somewhere?  
> Is there a full online version of 1003.1c on the web somewhere?

No. The IEEE derives revenue from sale of its standards, and does not
give them away. I understand this policy is "under review". It doesn't
really matter, though, unless you intend to IMPLEMENT the standard.
1003.1c is not a reference manual, and if you want to learn how to use
threads, check out a book that's actually written to be read; for
example, my "Programming with POSIX Threads" (Addison-Wesley) or Bil
Lewis' "Multithreaded Programming with Pthreads" (Prentice Hall) [which,
I see, is so popular that someone has apparently stolen my copy from my
office: well, after all, the spine IS somewhat more colorful than my
book ;-) ].

On the other hand, what IS freely available on the web is the Single
UNIX Specification, Version 2, including "CAE Specification: System
Interfaces and Headers, Issue 5", which is the new UNIX98 brand
specification that includes POSIX threads (plus some extensions). This
document includes much of the text of POSIX 1003.1c, though in slightly
altered form. Check it out at

      http://www.rdg.opengroup.org/onlinepubs/7908799/toc.htm

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q190: Why is there no InterlockedGet?  

> ) My question is renewed, however.  Why is there no InterlockedGet and
> ) InterlockedSet.  It seems under the present analysis, these would be
> ) quite useful and necessary.  Their absence was leading me to speculate
> ) that Intel/Alpha/MS insure that any cache incoherence/lag is not
> ) possible.
> 
> InterlockedExchange and InterlockedCompareExchange give you
> combined Get and Set operations, which are more useful.

Unfortunately, InterlockedCompareExchange is not available
under Windows '95, only NT.  But yes, I agree with you....
=================================TOP=============
 Q191: Memory barrier for Solaris  

>I was wondering if anyone knew how to use memory barriers 
>in the Solaris environment. I believe that Dave B.
>posted one for the DEC Alphas.

I assume you use Solaris on a Sparc.

First you should decide whether you are programming for Sparc V8 or Sparc 
V9. Buy the appropriate Architecture manual(s) from Sparc International 
(see: http://www.sparc.com/sparc.new/shop/docs.html )

I have not seen the Sparc Architecture manuals on-line. If someone has, I 
would be grateful for a pointer...

The Sparc chips can be set in different modes regarding the memory model 
(RMO, PSO, TSO). You need to understand the concepts by reading the 
Architecture manual (chapters 6 and J in V8, chapters 8, D and J in V9). 
It is also helpful to know which ordering model Solaris uses for your 
process.

In V8, the "barrier instruction" you are looking for is "stbar". You can 
use it by specifying
        asm(" stbar");
in your C code.

In V9, the architecture manual says:

"The STBAR instruction is deprecated; it is provided only for compatibility 
with previous versions of the architecture. It should not be used in new 
SPARC-V9 software. It is recommended that the MEMBAR instruction be used in 
its place."

The deprecated stbar instruction is equivalent to MEMBAR #StoreStore.

In V9, "memory barriers" are done with the membar instructions. As far as I 
can see, there are 12 different types of the instructions, depending on the 
type of memory barrier you want to have (check the architecture manual).

=================================TOP=============
 Q192: pthread_cond_t vs pthread_mutex_t  

Jason Mancini wrote:

> I wrote a small program that loops many times,
> locking and unlocking 3 mutexes.  The results are
> 4.2 million mutex lock-unlocks per second.  Doing the same
> for two threads that wait and signal each other results
> in 26,000 wait-signals per second using conditional
> variables.

Of course, this information is largely useless without knowing what
hardware and software you're using. But nevermind that -- it probably
doesn't matter right now that the numbers are meaningless.

> Any explanations as to why conds are so much slower
> than mutexes?  There are no collisions in any of the
> mutex acquisitions.  Also it seems like the mutex rate
> should be higher that it is.

So, you're trying to compare the performance of UNCONTENDED (that is,
non-blocking) mutex lock/unlock versus condition variable waits. Note
that waiting on a condition variable requires a mutex lock and an
unlock, PLUS the wait on a condition variable. Waking a thread that's
waiting on a condition variable also requires locking and unlocking the
same mutex (in order to reliably set the predicate that must be tested
for a proper condition wait). (If you're not locking in the signalling
thread, then you're doing it wrong and your measurements have no
relevance to a real program.)

Why, exactly, would you expect the performance of the condition variable
protocol to be equivalent to the mutex protocol that consists of a small
part of the condition variable protocol -- and, most importantly, that
excludes the actual BLOCKING part of the condition variable protocol?

As for the mutex rate -- 4.2 million per second means that each
lock/unlock pair takes less than 1/4 of a microsecond. Given the
inherent memory system costs of instructions intended to allow
synchronization on a multiprocessor, you'd need to be running on a
REALLY fast machine for that number to be "bad".

> Is there anything faster available for putting many
> threads to sleep and waking them up many times a
> second?

If your 26,000 per second rate isn't good enough, then the answer is
"probably not". Still, by my count, that's way up in the range of "many
times a second". What exactly are you attempting to accomplish by all
this blocking and unblocking, anyway? If you're doing it as a
consequence of some real work, then what's important is the performance
of the WORK, not the cost of individual operations involved in the work.
(You should really be trying to AVOID blocking, not worrying about
blocking faster, because blocking will always be slower than not
blocking.)

Synchronization is not the goal of multithreaded programming. It's a
necessary evil, that's required to make concurrent programming work.
Synchronization is pure overhead, to be carefully minimized. Every
program will run faster without synchronization... unfortunately, most
concurrent programs won't run CORRECTLY without it.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q193: Using DCE threads and java threads together on hpux(10.20)  
>I was wondering if anyone here has had any experience with
>using dce threads and java threads together on hpux(10.20)
>platform.

I'm presuming that you're using a JavaSoft jvm.


DON'T DO THAT!  DON'T EVEN TRY!!!!!


The DCE threads on hpux 10.20 are a user-space threads package.
The JVM uses a (different!) user-space threads package.

Ne'er the two shall meet.

If you have access to an hpux 11.x box, which has kernel threads,
there is a better chance of it working (not great, but better).

It's been quite a while since I looked inside the JavaSoft JVM,
but I seem to recall that the thread API isn't too ugly; you should
probably use those calls to do your C++ threads, but be warned
that you're in some rocky, unexplored territory.  I've even heard
that there be dragons there...

=================================TOP=============
 Q194: My program returns enomem on about the 2nd create.  
>   We just upgraded our Alpha from a 250MHz something to a 500+MHz dual
> processor running Digital Unix V4. My program which previously had no
> problem creating hundreds of threads returns enomem on about the 2nd to
> 4th thread create. DEC support advised increasing maxusers from 128 to
> 512 but to no avail. We've got 2Gig of memory and some other sys

The real question is, how much memory does the application use before that
final thread is created? The VM subsystem has an "optimization" for
tracking protected pages that simply doesn't work well with threads. The
thread library always creates a protected page for each stack, to trap
overflows. (You can run without this, by setting the guardsize attribute
to 0... but you shouldn't do that unless you're willing to be money that
your thread won't ever, under any circumstances, overflow the stack;
without the guard page, the results will be catastrophic, unpredictable,
and nearly impossible to debug.)

The problem is that the VM subsystem has a table for dealing with adjacent
pages of differing protection, and it's based on the entire memory size of
the process. If the vm-pagemax parameter is set to 2048, and you have 2048
pages allocated in the process, and you try to protect one of them, the
attempt will fail. If the protection was occurring as part of a stack
creation, pthread_create will return ENOMEM.

While most threaded programs will see this only when they create a lot of
threads (so that the aggregate stack allocation brings the process up over
the vm-vpagemax limit), any program that allocates lots of memory before
thread creation can hit the same limit -- whether the allocation is mmap,
malloc, or just a really big program text or data segment.

So check your vm-vpagemax and make sure it's big enough.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP=============
 Q195:  Does pthread_create set the thread ID before the new thread executes?  
 
Wan-Teh Chang wrote:

> The first argument for pthread_create() is the address of a pthread_t
> variable in which pthread_create() will write the new thread's ID before
> it returns.
>
> But it's not clear whether the new thread's ID is written into the
> pthread_t
> variable before the new thread begins to run.

The POSIX standard implies that the ID will not be set before the thread is
scheduled. The actual text is "Upon successful completion, pthread_create
shall store the ID of the created thread [...]." You always need to
remember, in any case, that threads operate asynchronously, and one great
way to hammer that message home is to prevent anyone from counting on
basics like having the create's ID when the thread starts.

(Yeah, that sounds mean, and I guess it is. But way back in the early days
of threading, when nobody know much about using threads, and people yelled
"I don't want to use synchronization, so you need to give me non-preemptive
thread scheduling!", we faced a really big problem in education. I
DELIBERATELY, and, if you like, maliciously, designed the CMA [and
therefore the DCE thread] create routine to store the new thread's id AFTER
the thread was scheduled for execution, specifically so that the thread
will, at least sometimes, find it unset, dragging the reluctant programmer,
kicking and screaming, into the world of asynchronous programming. This was
a purely user-mode scheduler, with a coarse granularity timeslicer, and it
was far too easy to write lazy code that wouldn't work on future systems
with multiple kernel threads and SMP. I couldn't prevent people from
getting away with bad habits that would kill their code later -- but I
could at least make it inconvenient! When I converted to a native POSIX
thread implementation for Digital UNIX 4.0, having battled the education
problem for over half a decade and feeling some reasonable degree of
success, I opted for convenience over forced education -- I set the ID
before scheduling the new thread, and made sure it was documented that
way.)

> I checked the pthread_create() man pages on all major commercial Unix
> implementations, and only the pthread_create(3) man page on Digital Unix
>
> (V4.0D) addresses this issue (and gives an affirmative answer):
>     DECthreads assigns each new thread a thread identifier, which DECthreads
>     writes into the address specified as the pthread_create(3) routine's thread
>     argument.  DECthreads writes the new thread's thread identifier before the
>     new thread executes.
>
> AIX 4.3, HP-UX 11.00, IRIX 6.3, and SunOS 5.6 do not specify the timing
> of the writing of new thread's ID relative to the new thread's execution.
>
> Is this something not specified in the POSIX thread standard?  I don't
> have a copy of the IEEE POSIX thread standard document, so all I can do
> is to read the man pages.  For my application, I need DECthreads'
> semantics that the new thread ID is written before the new thread
> executes.  I guess on other platforms, I will need to have use a mutex to
> block the new thread until the pthread_create() call has returned.

If you really need to code a thread that uses its own ID immediately, you
have a few choices. One, yeah, it can lock a mutex. Just hold a mutex (or,
better, use a condition variable and some predicate) around the
pthread_create call, and treat the thread ID as a shared resource. (Which
it is, although, since it's write-once, and thread create already
guarantees a consistent view of the address space to the created thread,
there's no need for additional synchronization if it's just written before
the new thread is scheduled.) Two, forget about the shared state and just
have the thread call pthread_self(), which will return the exact same ID
that the creator has stored (or will store).

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q196: thr_suspend and thr_continue in pthread  
 
Niko D. Barli wrote:

> Is there anyway to implement, or to emulate
> Solaris thr_suspend() and thr_continue() in
> pthread ?

Yes, there is. But it's ugly and the result is an inefficient and
complicated wart that reproduces almost all of the severe problems
inherent in asynchronous suspend and resume. If you check the Deja News
archive for this newsgroup, you can probably dig up (much) earlier posts
where I actually told someone where to find suspend and resume code.

> This is the case why I need to use thr_suspend
> and thr_continue.

Think again! You don't need them, and you'll be better off if you don't
use them.

> I have 3 servers running on 3 hosts.
> Each server have 2 threads, each listening to 1 of the other 2 servers.
> Socket information is held in global area, so that every thread can
> access it.
>
> For example, in Server 1 :
>  - thread 1 -> listening to socket a (connection to Server 2)
>  - thread 2 -> listening to socket b (connection to Server 3)
>
> In each thread, I use select to multiplex between socket
> communication and standard input.
>
>   ..........
>
> For example, I ask server 1, to read data from server 2
> (by inputing command from stdin). If my input from stdin
> handled by thread 1, there will be no problem.
> But if thread 2 handle it, thread 2 will send request for
> data to server 2 and waiting. Server 2 will send back data,
> but the data is now listened by BOTH thread 1 and thread 2.
>
> So what I want to do is to suspend thread 1, and let thread 2
> get the data.

You do NOT want to use suspend and resume for this! What you're talking
about is SYNCHRONIZATION between two threads sharing the same resource.
Suspend and resume are NOT synchronization functions, and they won't do
what you want. For example, if you simply depend on asynchronously
suspending one thread "before" it reads from stdin, what if you're late?
(Threads are asynchronous, and, without explicit synchronization, you
cannot know what one is doing at any given time.) What if the thread
you've suspended has already tried to read, and currently has a stdio
mutex locked? Your other thread will simply block when it tries to read,
until the first thread is eventually resumed to complete its read and
unlock the mutex.

Suspend and resume are extremely dangerous and low-level scheduling
functions. You need to know a lot about everything a thread might possibly
be doing before you can safely suspend it -- otherwise you risk damaging
the overall application. (Very likely causing a hang.) If you don't know
every resource a thread might own when you suspend it, or you don't own
every resource YOU might need to do whatever it is you'll do while the
other thread is suspended, then you cannot use suspend and resume. Even if
you do know, and control, all thread, there is always a better and less
dangerous solution than suspend and resume. (Suspend and resume are often
used because they seem convenient, and expedient; and there are even rare
cases where they can be used successfully. But, far more often than not,
you'll simply let your customer find out how badly you've broken your
application instead of finding it yourself.)

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q197: Are there any opinions on the Netscape Portable Runtime?  


I am working on the Netscape Portable Runtime (NSPR), so
my opinions are obviously biased.   I'd like to provide some info
to help you evaluate this product.

First, NSPR is more than a thread library.  It also includes
functions that are tied to the thread subsystem, most notably
I/O functions.  I/O functions can block the caller, so they must
know which thread library they are dealing with.

The thread API in NSPR is very similar to pthreads.  The
synchronization objects are locks and condition variables.
The NSPR thread API does not have suspend, resume,
and terminate thread functions.  It also does not have the
equivalent of pthread_exit().  NSPR has thread interrupt,
but not thread cancel.  The absence of these functions
in the API is by design.  Also, condition variables are associated
with locks when they are created, and condition notification
must be done while holding the lock, i.e., no "naked" notifies.

The implementation of  NSPR is often merely a layer
on top of the native thread library.  Where there are no native
threads available, we implement our own user-level threads.
NSPR does do a few "value-added" things:
1.  The condition variable notifies are moved outside of the
     critical section where possible.  You must write code
     like this:
           PR_Lock(lock);
           
           PR_NotifyCondVar(queue_nonempty_cv);
           PR_Unlock(lock);
    The actual pthread calls made by NSPR are:
           pthread_mutex_lock();
           
           pthread_mutex_unlock();
           pthread_cond_signal();
    (We use reference count on the condition variables to deal with
    their destruction.)
2. In some two-level thread implementations, a blocking I/O call
    incurs the creation of a kernel schedulable entity (i.e., LWP).  To
minimize
    the number of LWPs created this way, the NSPR I/O functions
    block all the callers on condition variablea, except for one thread.

    A lucky thread is chosen to block in a poll() call on all the
    file descriptors on behalf of the other threads.
3. On NT, NSPR implements a two-level thread scheduler using
    NT fibers and native threads and uses NT's asynchronous I/O,
    while still presenting a blocking I/O API.  This allows you to
    use lots of threads and program in the simpler blocking I/O model.
4. Where it is just too expensive to use the one-thread-per-client
    model but you don't want to give up the simplicity of the blocking
    I/O model, there is work in progress to implement a "multiwait
    receive" API.
    (See http://www.mozilla.org/docs/refList/refNSPR/prmwait.html.)

These are just some random thoughts that came to mind.  Hope it helps.

Wan-Teh
=================================TOP=============
 Q198: Multithreaded Perl  
Hello All,

I have just finished to update my Win32 IProc Perl module version
0.15,now i can 
really say that it's a complete module.

Here is all the methods that i have implemented:

    new() (Constructor)
    Create()
    CloseHandle()
    ExitProcess()
    ExitThread()
    FindWindow()
    GetAffinityMask() (WinNT only)
    GetCommandLine()
    GetCurrentHandle()
    GetCurrentId()
    GetExitCode()
    GetExitCodeThread()
    GetThreadStatus()
    GetWorkingSet() (WinNT only)
    GetStatus() (WinNT only)
    GetPriorityClass()
    GetPriorityBoost() (WinNT only)
    GetThreadPriority()
    GetThreadPriorityBoost() (WinNT only)
    Kill()
    LastError()
    Open()
    Resume()
    SetAffinityMask() (WinNT only)
    SetIdealProcessor() (WinNT only)
    SetPriorityBoost() (WinNT only)
    SetPriorityClass()
    SetWorkingSet() (WinNT only)
    SetThreadPriority()
    SetThreadPriorityBoost() (WinNT only)
    SetThrAffinityMask() (WinNT only)
    ShowWindow()
    Sleep() 
    Suspend()
    SwitchToThread() (WinNT only)
    Wait()


With all those 35 methods you will be in a complete control of your 
Threads and Processes.

Add to this my Win32 MemMap that comes with:

 o SysV like functions (shmget,shmread,shmwrite ...)
 o Memory mapped file functions
 ... 

Plus my Win32 ISync module that comes with a complete 
sychronisation mechanisms like: 

 o Mutex
 o Semaphores
 o Events
 o Timers 

 
and the sky will be the limit .

I have included a lot of examples on my modules,i have also updated my
IProc 
Perl documentation,you will find all the doc at:
http://www.generation.net/~cybersky/Perl/iprocess.htm

and all my modules at:
http://www.generation.net/~cybersky/Perl/perlmod.htm


Than you for your time,and have a nice weekend.

Regards
Amine Moulay Ramdane.

"Long life to Perl and Larry Wall!"
=================================TOP=============
 Q199: What if a process terminates before mutex_destroy()?  
 
> File locks are released if a process terminates (as the files are closed),

Correct.

> while SYSV-IPC semaphores are persistant across processes,

Unless you specify the SEM_UNDO flag.

> What about (POSIX) mutex's?

There is no cleanup performed on them when a process terminates.
This could affect a mutex (or condition variable) with the process-
shared attribute that is shared between processes.

    Rich Stevens
 

One more point: the "persistence" of an IPC object is different from
what you are asking about, which is whether an IPC object is "cleaned
up" when a process terminates.  For example, using System V semaphores,
they always have kernel persistence (they remain in existence until
explicitly deleted, or until the kernel is rebooted) but they may or
may not be cleaned up automatically upon process termination, depending
on whether the process sets the SEM_UNDO flag.

Realize that automatic cleanup is normally performed by the kernel (as
in the System V semaphore case and for fcntl() record locks) but the
Posix mutual exclusion primitives (mutexes, condition variables, and
semaphores) can be (are normally?) implemented as user libraries, which
makes automatic cleanup much harder.

And, as others have pointed out here, automatic cleanup of a locked
synchronization primitive may not be desireable: if the primitive is
locked while a linked list is being updated, and the updating process
crashes, releasing the locked primitive does not help because the
linked list could be in some intermediate state.  But there are other
scenarios (such as an fcntl() record lock being used by a daemon to
make certain only one copy of the daemon is started) where the automatic
cleanup is desired.

> What about (POSIX) mutex's?  I don't see this documented anywhere.

It's hidden in the Posix specs--sometimes what is important is not
what the Posix spec says, but what it doesn't say.  "UNIX Network
Programming, 2nd Edition, Volume 2: Interprocess Communications"
(available in ~2 weeks) talks about all this.

    Rich Stevens
=================================TOP=============
 Q200: If a thread performs an illegal instruction and gets killed by the system...  
> % threads should remain open for the life of the application.  However
> % they could perform an illegal instruction and get killed by the system.
> % I would like for the thread creator to post an error that a thread has
> % died, AND then restart the killed thread.
>
> You don't have to worry about this particular case, since the system will
> kill the entire process for you if this happens. Threads aren't processes.

I've answered many questions, here and in mail, from people who expect that
illegal instructions or segmentation faults will terminate the threads. And
even from people who realize that it will terminate the process, but think
they WANT it to terminate only the thread.

That would be really, really, bad. A quick message to anyone who thinks they
want the process to "recover" from a segv/etc. in some thread: DON'T TRY IT.
At best, you'll just blow up your program later on. At worst, you'll corrupt
permanant external data (such as a database file), and won't detect the error
until much, much later.

Remember that a thread is just an "execution engine". Its only private data
is in the hardware registers of the processor currently executing the thread.
Everything else is a property of the ADDRESS SPACE, not of the thread. A
SIGSEGV means the thread has read incorrect data from the address space. A
SIGILL means the thread has read an illegal instruction from the address
space. Either a pointer (the PC in the case of SIGILL) or DATA in the address
space has been corrupted somehow. This corruption may have occurred within
the execution context of ANY thread that has access to the address space, at
any time during the execution of the program. It does NOT mean that there's
anything wrong with the currently executing thread -- most often, it's an
"innocent victim". The fault lies in the program's address space, and
potentially affects all threads capable of executing in that address space.

There's only one solution: shut down the address space, and all threads
within it, as soon as possible. That's why the default action is to save the
address space and context to a core file and shut down. This is what you want
to happen, and you shouldn't be satisfied with anything less. You can then
analyze the core file to determine what went wrong, and try to fix it.
Meanwhile, you've minimized the damage to any external invarients (files,
etc.)... and, at the very least, you know something went wrong.

In theory, an embedded system might handle a SIGSEGV, determine exactly what
happened, kill any "wayward" thread responsible for the corruption, repair
all data, and continue. Don't even IMAGINE that you can do this on anything
but a truly embedded system. You may be able to detect corruption in your
threads, and in the data under control of your code -- but you link against
libpthread, libc, and probably other libraries. They may have their own
threads, and certainly have LOTS of their own data. You cannot analyze or
reconstruct their data. The process is gone. Forget it and move on with life.

If you need to write a "failsafe" application, fork it from a monitor
process. Do NOT share any memory between them! The parent simply forks a
child, which exec*s the real application. The parent then wait*s for the
child, and if it terminates abnormally, forks a replacement. Either the
parent (before creating the replacement) or the replacement (on startup)
should analyze and repair any files that might have been damaged. And then
you're off and running. Safely.

> % I was going to use the posix call "pthread_join" to wait for thread
> % exits. However using  "pthread_join" does not give the thread id of the
> % thread that has died. Is there a way to do this
> % using another thread command.
>
> Well, you say
>   pthread_join(tid, &status;);
>
> and if it returns with a 0 rc, the thread that died was the one with
> id _tid_. Your real problem here is that pthread_join won't return
> until the thread formerly known as tid has gone away, so you can't really
> use it to wait for whatever thread goes away first.

My guess is that John is writing code for Solaris (he wrote the article on
Solaris 2.6), and was planning to use the unfortunate Solaris join-any
"extension".

John, don't do that! It's a really, really, bad idea. Unlike UNIX processes,
there's no parental "line of descent" in threads. It's fine to have a "wait
any" that waits for any CHILD of the calling process, and therefore it seems
obvious to extend this concept to threads. But a thread has no children.
There are just an amorphous set of threads within a process, all equals. You
create threads, say, and a database program you're using creates threads, and
a fast sort library it uses creates more threads. Maybe you're also using a
thread-aware math library that creates more... and perhaps the thread library
has its own internal threads that occasionally come and go. Guess what? Your
"join any" will intercept the termination of the NEXT THREAD IN THE PROCESS
to terminate. It may be yours, or the thread library's, or anyone else's. If
it's someone else's thread, and the creator CARED about the termination of
that thread, you've broken the application. (Yeah, YOU broke it, because
there's nothing the library developer could reasonably be expected to do
about it.)

Generally true statement: Anyone who uses "join any" has a broken process.
The only exception is when you're sure that can not possibly, ever, be any
threads in the process you didn't create. And I don't believe anyone can ever
reasonably be sure of that in a modular programming environment. That is, if
you link against a library you didn't write, you don't know it can't ever use
threads. And if it ever does, you lose.

The POSIX pthread_join() function was nearly eliminated from the standard at
several points during the development of the standard. It's a minimal "helper
function" that does nothing of any particular value. It's utterly trivial to
implement pthread_join() yourself. Combine one part mutex, one part condition
variable, a dash of data; stir, and serve. You want to join with ANY of your
worker threads? No problem. Just add another sprinkle of data to record (if
you care) which thread terminated. You don't even need to add more mutexes or
condition variables, because they can all share one set. (To code a "normal"
pthread_join, you'd usually want each thread to have its own set.) Before
terminating, each thread locks the mutex, stores its termination state
(whatever information you want from it) and (if you want) its ID, then
signals or broadcasts (depending on your desired semantics) the condition
variable. To "join", you just wait (in, of course, a correctly tested
predicated condition wait loop!) for someone to have terminated.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
=================================TOP=============
 Q201: How to propagate an exception to the parent thread?  

> Does anyone have or know of  an approach to mixing threads with C++
> exception handling?  Specifically, how to propagate an exception to the
> parent thread.  I can catch any exceptions thrown within a thread by way
> of a try block in the entry function.  The entry being a static class
> member function.  (I know, a "state of sin" wrt to C++ and C function
> pointers, but it works.)  Copying the exception to global (static or
> free store)  memory comes to mind.


It was the original intention of the designers of the C++ language to 
allow one to throw/catch across thread boundaries.  As far as I know 
the recently ratified ISO C++ standard makes no mention of threads 
whatsover.  (The ISO C++ committee considered threads an OS and/or 
implementation, not a language issue.  BTW please don't send me a 
bunch of replies agreeing or disagreeing, I am not on the ISO 
committee, I am just reporting the facts. ;-)  However, I am unaware 
of ANY compiler that implements the ability to throw/catch across 
thread boundaries.  I have also discussed this issue with some 
experienced C++ programmers, and they also are unaware of any compiler
that implements this.  I am told that CORBA allows this if you want to
take that approach.  You many want to repost this in 
comp.lang.c++.moderated.  In general there are some problems with this
approach anyway, simply killing a thread does not cause the C++ 
destructors to be called.  (Again I say generally, because the ISO 
standard makes no mention of threads, there is no portable behaviour 
upon which you may count.)  It is usually better to catch an exception
within the thread that threw it anyway.

Peace

Peter



NB: There is no such thing as a "parent" thread.  All threads are created
equal.  But we know what you mean.  

RogueWave's threads.h++ does a rethrow of exceptions in the user-selected
thread (the assigned "parent").  You may wish to look at that.

-Bil
=================================TOP=============
 Q202: Discussion: "Synchronously stopping things" / Cheating on Mutexes  




William LeFebvre wrote:

> In article <[email protected]>, Bil Lewis   wrote:
> >  Practically speaking, the operation of EVERYBODY (?) is that a store
> >buffer flush (or barrier) is imposed upon unlocking, and nothing at all
> >done on locking.
>
> Well, except the guarantee that the lock won't be obtained until
> the flush is finished.

Actually, that's incorrect. There may be no "flush" involved. That's the whole
problem with this line of reasoning. One side changes data and unlocks a mutex;
the other side locks a mutex and reads the data. That's not a discrete event,
it's a protocol; and only the full protocol guarantees visibility and ordering.

I dislike attempts to explain mutexes by talking about "flushes" because while a
flush will satisfy the requirements, it's not a minimal condition. A flush is
expensive and heavy-handed. All that's required for proper implementation of a
POSIX mutex is an Alpha-like (RISC) memory barrier that prevents migration of
reads and writes across a (conceptual) "barrier token". This affects only the
ordering of memory operations FROM THE INVOKING PROCESSOR. With a similar memory
barrier in the correct place in mutex lock, the protocol is complete. But with
only half the protocol you get ordering/visibility on one side, but not on the
other; which means you haven't gotten much.

As implied by the quoted statement above, once you've GOTTEN the mutex, you can
be sure that any data written by the previous holder, while the mutex was
locked, has also made its way to the memory system. The barrier in unlock
ensures that, since the unlocked mutex value can't go out before the previous
data goes out; and the barrier in lock ensures that your reads can't be issued
before your mutex lock completes. But this assurance is not necessarily because
of a "flush", and the fact that someone else unlocked a mutex after writing data
is not enough to ensure that you can see it; much less that you can see it in
the correct order.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


Subject: Re: synchronously stopping things

David Holmes wrote:

> [email protected] wrote in article <[email protected]>...
> > One place where I try to avoid a mutex (at the risk of being called a
> fool)
> > is in singletons:
> >
> > MyClass* singleton(Mutex* mutex, MyClass** instance)
> > {
> >   if (*instance == 0) {
> >     mutex->lock();
> >     if (*instance == 0)  // must check again
> >       *instance = new MyClass();
> >     mutex->unlock();
> >   }
> >   return *instance;
> > }
>
> This coding idiom is known as the "Double Checked Locking pattern" as
> documented by Doug Schmidt (see his website for a pointer to a paper
> describing the pattern in detail). It is an optimisation which can work but
> which requires an atomicity guarantee about the value being read/written.
>
> The pattern works as follows. The variable being tested must be a latched
> value - it starts out with one value and at some point will take on a
> second value. Once that occurs the value never changes again.
>
> When we test the value the first time we are assuming that we can read the
> value atomically and that it was written atomically. This is the
> fundamental assumption about the pattern. We are not concerned about
> ordering as nothing significant happens if the value is found to be in the
> latched condition, and if its not in the latched condition then acquiring
> the mutex enforces ordering. Also we do not care about visibility or
> staleness.

The last part is critical, and maybe rather subtle. You have to not care about
visibility or latency.

So... the code in question is broken. It's unreliable, and not MP-safe at all.
That is, it's perfectly "MT" [multithread] safe, as long as you're on a
uniprocessor... but move to an aggressive MULTIPROCESSOR, and it's "game
over".

Why? Yeah, I thought maybe you'd ask. ;-)

The problem is that you're generating a POINTER to an object of class MyClass.
You're creating the object, and setting the pointer, under a mutex. But when
you read a non-NULL value of the pointer, you're assuming that you also have
access to the OBJECT to which that pointer refers.

That is not necessarily the case, unless you are using some form of explicit
synchronization protocol between the two threads, because having set the value
under a mutex does not guarantee VISIBILITY or ORDERING for another thread
that doesn't adhere to the synchronization protocol.

Yes, "visibility" might seem not to be an issue here -- either the other
thread sees the non-NULL value of "instance", and uses it, or it sees the
original NULL value, and rechecks under the mutex. But ORDERING is critical,
and it's really a subset of VISIBILITY.

The problem is that the processor that sees a non-NULL "instance" may not yet
see the MyClass data at that address. The result is that, on many modern
SMP systems, you'll read garbage data. If you're lucky, you'll SEGV, but you
might just accept bad data and run with it... into a brick wall.

The more aggressive your memory system is, the more likely this is to occur.
You wouldn't, for example, have any problem running on an Alpha EV4 chip...
but on an EV5 or EV6 SMP system, you'll probably end up with intermittent
failures that will be nearly impossible to debug, because they'll often depend
on nanosecond timing factors that you can't reproduce reliably even in
production code, much less under a debugger. (And if you slip by that
"probably" and miss the races, you can be sure that one of your customers will
run into one eventually... and that's even less fun.)

You can fix this problem very simply without a mutex, but that solution is
machine dependent. For example, using DEC C on a Digital UNIX Alpha system, it
could be as simple as changing your test to:

     #include 
         [...]
     if (*instance == 0) {
         [...]
     } else
         asm("mb");

The "mb" (memory barrier) between the test for the non-NULL pointer, and any
later dereferences of the pointer, ensure that your memory reads occur in the
correct and safe order. But now your code isn't portable. And get it wrong in
one place, and your program is toast. That's what I meant in a previous post
about the risk and cost of such optimizations. Is the cost of locking the
mutex really so high that it's worth sacrificing portability, and opening
yourself up to the whims of those ever-more-creative hardware designers?
Sometimes, yes. Most of the time... no way.

/---------------------------[ Dave Butenhof ]--------------------------\
| Digital Equipment Corporation                   [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/



From - Sun May 10 01:31:31 1998
From: Achim Gratz 
Newsgroups: comp.programming.threads
Subject: Re: Mutexes and memory systems (was: synchronously stopping things)


OK, I'm trying to wrap things up a bit, although I guess it will still
become rather long.

Forget caching, memory barriers, store buffers and whatever.  These
are hardware implementation details that are out of your control.  You
don't want to know and most of the time you don't know everything
you'd need to anyway, which is more dangerous than knowing nothing.
Trying to infer the specification from the implementation is what gets
you into trouble.

When you lock mutex A, POSIX gives you a guarantee that all shared
data written by whatever thread under mutex A is visible to your
thread, whatever CPU it might run on, and has been completely written
before any reads to it can occur (this is the ordering part).  When
you unlock the mutex, it is best to assume that the shared data
vanishes in neverland.  It is not guaranteed to be up-to-date or
visible at all by POSIX, nor can you infer any order in which the
writes may be performed or become visible.  It is up to the
implementation to employ the hardware in the appropriate manner.
Efficient employment of the hardware is a quality of implementation
issue that's to be considered after the proof of correctness.

[ I don't have the standard, only Dave's book, but that is the
definition that would allow for the most agressive memory
implementations.  It seems sensible, to me at least, to assume this
definition for utmost portability.  It would be interesting to know if
the exact wording does indeed support my interpretation and whether it
was intended to be that strong.  It would seem that for multiple
concurrent readers you'd need to lock multiple mutexes if you want to
stay within the bounds of the above definition. ]


The problem with this definition and the origin (I think) of the
brouhaha in various (usenet) threads here in this group is that
implementation of multiple reader situations seems overly expensive
since a single mutex allows only a single reader at a time, else you
need a mutex for every reader.  All of the schemes presented so far
that propose to avoid the mutex locking by the readers rely on further
properties of the hardware or pthread library implementation.
Fortunately or unfortunately these properties exist on most if not all
hardware implementations in use today or the library implementors have
taken care to slightly expand the guarantees made by POSIX because you
usually don't tell the customer that he is wrong.


Why do these hacks work?

1) no hardware designer can design a system where it takes an
unbounded time for a write to memory to propagate through the system
as that requires infinite resources

2) noone conciously introduces longer delays than absolutely necessary
for reasons of efficiency and stability

3) there is no implementation, AFAIK, of shared memory that is visible
only under an associated mutex, although capability based
architectures might have them

4) in the absence of 3, keeping a directory of the data changed under
each mutex is likely to be more expensive than making all data visible
and introducing order with respect to any mutex except for those that
are still under lock for writing

5) on a shared memory system that lacks 3 and 4, once the data has
been forced to memory by one processor, it is visible to all
processors and no stray stale copies are in caches if the write
occured under mutex lock; even if not ordered on other processors,
data in memory becomes visible after a bounded time because of 1 and 2

6) the above holds for ccNUMA systems as well, although the time for
propagation can be considerably longer


What does not work?

a) any writes of shared data, without locking, atomic or not, with the
exception of "one-shot-flags" (i.e. any value different from the
initial one that has to be set before the threads are started signals
that some event occured and you never change the value back and you
don't care about the exact value itself)

b) multiple changes to a variable without unlocking/locking between
each change may not become visible at all or may be seen to have
different values in different threads

c) any read of shared data, without locking, where accessing data
after the writer has released the lock would be an error or getting
old data can't be tolerated

d) any read of shared data, without locking, that is not properly
aligned or is larger than the size of memory transactions (there may
be several), tearing may occur without notice

e) porting to a system where strict POSIX semantics are implemented
(e.g. NUMA systems with software coherency)


Summary:

You can't do much safely without those pesky mutexes.  The things you
can do aren't, IMHO, in the critical path (performance-wise) most of
the time.  The potential payback is low and the original problem can
be solved by barriers just fine within POSIX (I think - it's been a
long day).  POSIX1003.1j will even give you barriers that may be more
efficient because the library writers took the time to evaluate the
hardware properties in depth.  That suggests that you could indeed
wring some cycles out of these hacks at the expense of portability and
correctness.  If you do, just make sure you don't give this piece of
code to anybody else.

[ If you think I'm paranoid, perhaps I am.  But I'm sick of commercial
software that isn't even linked properly so that it breaks on new
machines and OS releases and sometimes even OS patches when it would
be a simple matter of controlling the build environment to do it
right.  If you have to support more than one of these beauties you get
a very big headache when you find out that the intersection of system
configurations that work is the null set.  Specifications and
standards exist for a reason.  If you don't like them, get them
changed.  Don't break things gratuitously. ]


Achim Gratz.

=================================TOP=============
 Q203: Discussion: Thread creation/switch times on Linux and NT   



I'm so excited about this that I had to restate what I now think
to be the key design differences between NT and Linux 2.0 wrt. task
switching:

1.  The design of the Linux scheduler appears to make the assumption that,
    at any time during "normal" operation, there will only be a small
    number of actually runnable processes.

2.  The Linux scheduler computes which of these runnable processes to run
    via a linear scan of the run queue - looking for the highest priority
    process.

3.  The Linux yield_cpu() function is EXTREMELY prejudicial towards the
    calling program.  If you call yield_cpu() you are not only yielding
    the CPU, but you are also setting your priority to zero (the lowest)
    meaning that you will not run again (because of #2 above) until ALL
    other runnable processes have had a bite at the CPU.

4.  A process, under Linux, steadly has its priority lowered as a function of
    how long it has been scheduled to the CPU.

5.  The Linux scheduler re-sets everyone's priority to a "base"
    priority once all of them have had their priority lowered to zero
    (either through #3 or #4).  This re-set entails another linear
    traversal of the run queue in schedule().


Comments:

a:  #1 is probably a very reasonable assumption.

b:  #2 causes task-switching time, on Linux, to degrade as more runnable
    processes are added.  It was obviously a design decision driven by
    the assumption in #1.

c:  #3 is, to me, a contentious issue.  Should you get penalized for
    voluntarily yielding the CPU - should it put you on the "back of the bus"
    or should it simply lower your priority by one?  After all, most other
    voluntary yields (such as for I/O or to sleep for a time) usually
    raise your priority under other UNIXs (don't know if that's the case
    with Linux - haven't checked).

    In either case, Mingo's code changes this policy.

d:  #4 is standard, textbook, OS stuff.

e:  #5 is another reasonable behavior.  The linear scan is, again, a function
    of the belief that #1 is true (or so I believe).

f:  Because of the combined effects of #1, #2, #3, and #5 my yield_cpu()
    benchmark was indeed extremely prejudicial to Linux since the assumptions
    that I was making were not the same as those of Linux's designers.  That
    doesn't mean my benchmark is, or was, a "bad" benchmark.  Quite the
    contrary - it illustrates in painful detail what happens when the
    designers of a system are using different criteria than those of the
    users of the system.

    It is up to the community to decide which criteria is more valid.

The net result is that Linux may well beat out NT for context switches
where the number of runnable processes is very small.  On the other hand,
NT appears to degrade more gracefully as the runnable process count
increases.  Which one is a "better" approach is open to debate.

For example, we could probably make Linux degrade gracefully (through hashing,
pre-sorting, etc.), as does NT, at the expense of more up-front work with
the resultant degradation in context-switch time where the # of processes
is very small.

On the other hand, the crossover point between Linux vs. NT appears to be
right around 20 runnable processes.  On a heavily loaded web server (say)
with 20-40 httpd daemons plus other code, does the "real world" prefer the
NT way or the Linux way?  How about as more and more programs become
multithreaded?

The great thing about Linux is that we have the source - thus these
observations can be made with some assurance as to their accuracy.  As for
NT, I feel like the proverbial blind man trying to describe something
I've never seen.

The other great thing is that we can change it in the manner that best
suits our needs.  I love choice and I hate Microsoft.

greg



Let us pray:
What a Great System.
Please Do Not Crash.


From: [email protected] (Linus Torvalds)
Subject: Re: Thread creation/switch times on Linux and NT (was Re: Linux users working at Microsoft!)
Date: 8 Mar 1998 01:31:03 GMT
Organization: Transmeta Corporation, Santa Clara, CA


In article ,
Greg Alexander  wrote:
>In article <[email protected]>, Gregory Travis wrote:
>>All process priorities were recomputed 99,834 times - or just
>>0.5% of the time.  Furthermore, only 31,059,779 processes (total)
>>were examined during those recalcs as opposed to the 61,678,377 that
>>were examined by the much more expensive "goodness" function.
>>
>>From my perspective, this would tend to strongly favor the current
>>scheduling implementation (simple linear search as opposed to more
>>complex but robust hashed run queue) - at least for web serving (strong
>>emphasis on the latter).  If I were to look for improvements, under this
>>scenario, I would focus on the "goodness" function since 4% of the time
>>we had to throw ten or more processes through it.  Perhaps bringing it
>>inline with the sched() function.
>>
>>But even that may be overkill since we only called sched() 24,221,164
>>times over a 17 hours period - or about 400 times per second.
>>
>>Comments?  I would be happy to make my modifications available (they are
>>trivial) to anyone who wants to instrument their own application.
>
>My biggest suggestion is to try kernel profiling.  Check if any notable
>amount of time is actually spent in goodness before worrying about changing
>it.

Also, check out 2.1.x - there are some changes to various details of the
scheduler that were brought on by the finer locking granularity, but
that were sometimes also related to performance. 

I do obviously agree with the basic points above - I wrote most of the
scheduler.  Usually there aren't all that many runnable processes even
under heavy load, and having a very simple linear queue is a win for
almost all situations in my opinion.  For example, if there are lots of
processes doing IO, the process list tends to be fairly short and you
really want a very simple scheduler for latency reasons.  In contrast,
if there are lots of CPU-bound processes, there may be lots of runnable
processes, but it very seldom results in a re-schedule (because they
keep running until the timeslot ends), so again there is no real reason
to try to be complex. 

So yes, under certain circumstances the current scheduler uses more CPU
than strictly necessary - and the "40 processes doing a sched_yield()
all the time" example is one of the worst (because it implies a lot of
runnable processes but still implies continuous thread switching). 

Personally I don't think it's a very realistic benchmark (it tells you
_something_, but I don't think it tells you anything you need to know),
which is one reason why Linux isn't maybe the best system out there for
that particular benchmark.  But it would be easy enough to make Linux
perform better on it, so I'll think about it. 

[ Even when I don't find benchmarks very realistic I really hate arguing
  against hard numbers: hard numbers are still usually better than just
  plain "intuition".  And I may well be wrong, and maybe there _are_
  circumstances where the benchmark has some real-world implications,
  which is why I wouldn't just dismiss the thing out-of-hand.  It's just
  too easy to ignore numbers you don't like by saying that they aren't
  relevant, and I really try to avoid falling into that trap. ]

The particular problem with "sched_yield()" is that the Linux scheduler
_really_ isn't able to handle it at all, which is why the Linux
sched_yield() implementation sets the counter to zero - I well know that
it's not the best thing to do for performance reasons, and I think it
unduly penalizes people who want to yield some CPU time, but as it
stands the scheduler can't handle it any other way (the "decrement
counter by one" approach that Ingo suggested is similarly broken - it
just happens to not show it quite as easily as the more drastic "zero
the counter", and it has some other problems - mainly that it doesn't
guarantee that we select another process even if another one were to be
runnable). 

I should probably add a "yield-queue" to the thing - it should be rather
easy to do, and it would get rid of the current scheduler wart with
regard to sched_yield().  My reluctance is purely due to the fact that I
haven't heard of any real applications that it would matter for, but I
suspect we need it for stuff like "wine" etc that need to get reasonable
scheduling in threaded environments that look different from pthreads(). 

        Linus
From - Sun Mar  8 15:03:12 1998
From: [email protected] (Gregory Travis)
Subject: Re: Thread creation/switch times on Linux and NT (was Re: Linux users working at Microsoft!)

Here's some more data, using the latest version of my context switching
benchmark.  Test machine is a 64MB 200Mhz Pentium "classic".

Switch              Number of processes/Threads
Time            2   4   8   10  20  40
            ----    ----    ----    ----    ----    ----
Std. Procs      19us    13us    13us    14us    16us    27us
Std. Threads        16us    11us    10us    10us    15us    23us

Mingo Procs      4us     6us    11us    12us    15us    28us
Mingo Threads        3us     3us     5us     7us    12us    22us

NT Procs        10us    15us    15us    17us    16us    17us
NT Threads       5us     8us     8us     9us    10us    11us


Explanation:

The "Std." entries show the results of my yield_cpu() benchmark against
the standard Linux scheduler using either threads or processes.

The "Mingo" entries show the results of the same benchmark but after the
Linux yield_cpu() entry has been modified per Mingo's suggestion so that
it doesn't take the counter to zero.

The "NT" entries show the results of the benchmark under NT.

Each benchmark was run twice for each number (to promot accuracy).  Thus
the above is the result of 72 individual runs.

Analysis:

The dramatic drop in context switch time, between the "Std." and "Mingo"
runs shows how expensive the priority recalc can be - for short run
queues at least.  Note that it makes little or no difference as the
run queue length exceeds about 10 processes.  This is almost certainly
because the cost of the "goodness" function begins to dominate the picture.
For a given number of iterations, the goodness function is much more
expensive than the priority recalc function.  The goodness function must
be performed on each runnable process while the priority recalc must be
performed on all processes.  Thus with a small # of runnable processes,
the expensive goodness function is not called much while the "cheap"
priority recalc is called for each process, runnable or not.  As the run
queue grows, however, the goodness function is called more (while the
priority recalc function is essentially constant).  Around ~15 processes,
on my system, the cost of "goodness" washes out the noise from the priority
recalc.

Nevertheless, the context switch times shown in the "Mingo" series is
probably closest to the actual Linux context switch times.  Note how the
series dramatically illustrates how context switch overhead, on Linux,
grows as a function of the run queue length.

It appears that the context-switch overhead for Linux is better than NT for
shortish run queues and, especially, where process/process switch time is
compared.  With run queues longer than about 20 processes, though, NT's
scheduler starts to beat out the Linux scheduler.  Also note that NT's
scheduler appears more robust than the Linux scheduler - its degradation
as the run queue grows is nowhere as dramatic as Linux's.  NT's thread
switch times doubled between 2 and 40 threads while Linux's showed a
>sevenfold< slowdown.

Does it matter?  Quite probably not.  From my earlier posting, with data
from a heavily loaded webserver, I saw an average run queue length of
2.5 processes.  The run queue exceeded 10 processes only about 4% of the
time.

I've put my benchmarks, as well as the kernel changes to record
run queue length, on anonymous ftp at weasel.ciswired.com

greg
From - Sun Mar  8 15:04:37 1998
From: [email protected] (Gregory Travis)
Subject: Re: Thread creation/switch times on Linux and NT (was Re: Linux users working at Microsoft!)


In article ,
Greg Alexander  wrote:
>In article <[email protected]>, Gregory Travis wrote:
>>Here's some more data, using the latest version of my context switching
>>benchmark.  Test machine is a 64MB 200Mhz Pentium "classic".
>>
>>Switch              Number of processes/Threads
>>Time            2   4   8   10  20  40
>>            ----    ----    ----    ----    ----    ----
>>Std. Procs      19us    13us    13us    14us    16us    27us
>>Std. Threads        16us    11us    10us    10us    15us    23us
>>
>>Mingo Procs      4us     6us    11us    12us    15us    28us
>>Mingo Threads        3us     3us     5us     7us    12us    22us
>>
>>NT Procs        10us    15us    15us    17us    16us    17us
>>NT Threads       5us     8us     8us     9us    10us    11us
>
>Does this look to you like NT maybe never traverses the tree and never
>updates priorities (assuming it even switches every time)?  This indicates
>non-complexity, which is beautiful, but I bet that they didn't do it well.
>(NT being VMS's deranged nephew or something)

I don't know what NT's scheduling algorithm is.  I'm very surprised, given
your comments below, that you are venturing an opinion on how NT works.  It
may not even use a list (what you referred to as a "tree" which it is not
in Linux) at all.

>Please, please, /PLEASE/ use profiling when talking about "this is almost
>certainly because the cost of the goodness function begins to dominate the
>picture."  It will tell you exactly which function dominates which picture
>quite clearly and simply.  It's much easier to say "goodness takes so much
>time, the recalc takes this much time," than bothering to make appeals of
>logic "goodness should take more time because."  Not that the latter is a
>bad idea in any case, just to explain why, but you should never explain why
>something is happening that you aren't certain is happening if you have an
>alternative.

Greg, so far you've contributed nothing positive to this venture other
than making most of us painfully aware that you don't even understand
ulimit and that your favorite way of showing how smart you are is by
throwing out red herrings at every opportunity.

I'll tell you what - why don't you try and reverse that impression?  I
spent about five hours of my life last night running the above sequence (not
to mention all the rest of the time I've devoted to this).  For the past
twenty years I've been paid to design and write software [including
a UNIX kernel release that used a scheduler I wrote] during the day
so perhaps you'll forgive me if I want to take this evening off and
instead watch Bill Gates lie on CSPAN.

So, here's something positive you can do: profile the kernel.  All my
sources and kernel changes are at weasel.ciswired.com (anonymous ftp).  Why
don't you take them and report back to us with your findings?  That would be
very nice, thanks.  Don't forget to do it with and without Mingo's very
helpful changes.

>Note that there are variables here you are controlling unintentionally. 
>Your statement would be better made as "With my benchmark and runqueues
>longer than about 20 processes, though, NT's..." or, to be specific, "When
>all runnable processes are calling sched_yield() in a loop and there are a
>minimal number of non-runnable processes and runqueues are longer than about
>20 processes..." and I'm sure there are plenty of other variables I've left
>out.  Having only about 80 processes, with 40 of them in a loop calling
>sched_yield(), you will not get general purpose numbers.  I'd almost expect
>more dormant processes to slow down linux more than NT in this case, but I
>don't know what would happen if the dormant processes were more like your
>"real life" example, i.e. many IO-bound programs that are awakened
>frequently, with an average of some number of them in the runqueue at once. 

You have an awful lot of "I'd almost expect," "I don't know," and
"I'm sure" statements for a guy who earlier so soundly admonished me
for stating what was clearly my opinion.

>NO!  Robust is the WRONG word!  Robust implies it can handle many different
>situations.  It is better at /THIS/ situation with large numbers of idling
>runnable processes.  Your test does not show how NT runs in real life
>situations.

I can accept that.  Where can I download your test?

>If NT's scheduler really were more "robust," it would matter a good deal. 
>All you've shown is that its times don't appear to grow linearly as the
>number of runnable idling processes grows.

Thank you.  That's all I claimed to show (along with the switch times).

greg
=================================TOP===============================
 Q204: Are there any problems with multiple threads writing to stdout?  

> >  > However, even if there are no problems, you may be seeing interleaved
> >  >output:
> >  >
> >  > example:
> >  >
> >  >  printf("x=%d, y=%d\n", x, y);
> >  >
> >  >there is no guarantee that x and y will appear on the same line
> >
> > Surely, printf() will lock the stream object (if you use the MT safe glibc2),
> > no?
>
> Not on Linux, or any other UNIX variant I've dealt with.  UNIX is used
> to it, even before threads.  stdout on NT doesn't make sense unless it's
> a console appliation.

For POSIX conformance, printf() must lock the process' stdio file stream. That is,
the output is "atomic". Thus, if two threads both call a single printf()
simultaneously, each output must be correct. E.g., for


         printf ("%d, %d\n", 1, 2);      printf ("%s, %s"\n", "abc",
                                         "def");

you might get

     1, 2
     abc, def

or you might get

     abc, def
     1, 2

but no more "bizarre" variations. If you do, then the implementation you're using
is broken.

There is another level of complication, though, if you're talking about the
sequence of multiple printf()s, for example. E.g., if you have

         printf ("%d", 1);               printf ("%s", "abc");
         printf (", %d\n", 2);           printf (", %s\n", "def");

Then you might indeed get something like

     abc1, def
     , 2

POSIX adds an explicit stdio stream lock to avoid this problem, which you can
acquire using flockfile() and release using funlockfile(). For example, you could
correct that second example by coding it as

         flockfile (stdout);             flockfile (stdout);
         printf ("%d", 1);               printf ("%s", "abc");
         printf (", %d\n", 2);           printf (", %s\n", "def");
         funlockfile (stdout);           funlockfile (stdout);

Of course, if you write to the same file using stdio from separate processes,
there's no synchronization between them unless there are some guarantees about how
stdio generates the actual file descriptor write() calls from its internal
buffering. (And I don't believe there is.)

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/



=================================TOP===============================
 Q205: How can I handle out-of-band communication to a remote client?  

  Stefan Rupp  wrote:
> Good afternoon,
>
> we encountered a problem in the design of our client-server architecture.
> The situation is as follows:
>
>  [1] the server runs as a demon process on an arbitrary host
>  [2] the client may connect to any number of servers
>  [3] when connected, the client requests data from a server
>      through a TCP socket and waits for the server to deliver
>      the requested data
>  [4] the server itself can send messages to the client at any
>      time without being asked by the client
>
> In a first step, I designed the client multithreaded, consisting of the
> 'main' thread and an 'I/O' thread, which handles the communication between
> the client and the server through a SOCK_STREAM socket with a select(2)
> call. The connection between the main thread and the I/O thread is made
> through a pair of pipes, so that the select call, which waits for
> incoming messages from the server as well as from the main thread,
> returns and handles the request. To open a new I/O thread for each server
> the client wants to connect to, is probably not a good idea, because I
> need two pipes for each thread to communicate with. So, only one I/O
> thread must handle the connection to any server the client connects to.
>
> Does anybody have a better idea how to design the client, so that it
> can handle unexpected callbacks from the server at any time? In the
> book "UNIX Network Programming" it is stated that signal driven I/O
> is nor advisable for a communication link through stream sockets, so
> that is not an option.
>
> Thanks!
>
> Doei,
>      struppi
>
> --
> Dipl.-Inform. Stefan H. Rupp
> Geodaetisches Institut der RWTH Aachen         Email: [email protected]
> Templergraben 55, D-52062 Aachen, Germany      Tel.:  +49 241 80-5295
>


Change the client a little. Have one thread that waits on the responses from
the socket- this is a blocking call so is VERY efficent - (you will want a
timeout in there to do houskeeping and to check to shutdown every few seconds
though). Have a second thread that sends messages to the server on the
socket. This is safe, because sockets are bidirectional async. devices. If
the receive thread knows how to deal with messages from the server the
archeticture is quite simple. You may need a queue of messages waiting to be
processed if processing time is long, or a queue of messages to send to the
server to prevent contention on SENDING to the server.

We have implemented a client server using such an archectutre - it works very
well with full async. bidirectional messaging between client and server. the
server can deal with 1500 messages (total not each) a second from 200 clients.

Nick



=================================TOP===============================
 Q206: I need a timed mutex for POSIX  

[email protected] wrote:

> I am doing multi-platform development, and have got several very successful
> servers running on NT and on AIX. The ptroblem is that NT is MUCH more
> efficient in it's MUTEX calls that AIX because of the POSIX mutex int
> pthread_mutex_lock (mutex) does not have a timeout, for that reason I need
> to do a loop doing a pthread_mutex_trylock (mutex) and a 20 milisecond sleep
> uintil timeout ( usually 5 seconds )

Why?

Or, more specifically, exactly what do you intend to do when the loop times
out?

   * Which thread owns the mutex? (No way to tell, without additional
     information that cannot be used reliably except under control of a mutex;
     and you've already declared that, in your application, the mutex usage
     protocol is unreliable.)
   * What is that thread doing? Is it hung? Broken? Did it get prempted and
     miss a deadline, but "still ticking"? Unless you know that (not
     impossible, but EXTREMELY difficult to implement, much less to get
     right), you CANNOT "steal" the mutex, or know what to do once you've got
     it.
   * You cannot force the owner of the mutex to unlock. You cannot unlock from
     your current thread. You can't assume you now own it. If you knew the
     owner, you could cancel it and join with it (as long as you know nobody
     else is already joining with it), hoping that "it's broken but not TOO
     broken". But then what happens if it doesn't terminate, or if it's
     sufficiently broken that it doesn't release the mutex on the way out?

This is the kind of thing that may sound "way cool" for reliable, fail-safe
servers. In practice, I doubt the value. That kind of fail-safety is almost
always complete illusion except in rigorously isolated embedded system
environments. And in such an environment, it's trivial to write your own
pthread_mutex_timedwait() or work out some alternate (and probably better)
method to recover your runaway state.

In a fully shared memory multithreaded server, when something's "gone wrong"
and you lose control (and that's what we're talking about), the ONLY safe
thing to do is to panic and crash the process, NOW. You can run the server
under a monitor parent that recognizes server exit and forks a new copy to
continue operation. You can keep operation logs to recover or roll back. But
you cannot make the process "fail safe".

> The problem is this is inefficient. NT has a Wait_for_MUTEX with timeout.
> this is good.
> (bummer, Bill got it right :-(   )

No. Just another misleading and overly complicated function that looks
neato-keen on paper. Any code that really, truly DEPENDS on such a capability
is already busted, and just doesn't know it yet.

(Oh, and, yes, I say this with the explicit knowledge that all generalizations
are false, including this one. There is certainly code that doesn't need to be
100% fail safe, and that may be able to productively use such primitives as a
timed mutex wait to slightly improve some failure modes. Maybe, in a very few
cases, maybe even yours, all of the time and effort that went into it provides
some real benefit. "The one absolute statement I might make is that none of my
statements are absolute." ;-) )

You can put together a "timed mutex" yourself, if you want, using a mutex and
a condition variable. Use the mutex to serialize access to control
information, such as your own ownership and waiter data, and use a condition
variable to wait for access. A waiter that times out can then determine which
thread (in your APPLICATION scheme) owns the "mutex". Of course, if the
application is really ill-behaved, then even the "control mutex" might not be
unlocked -- I doubt you could do much in that case, anyway.

One final note. As I said, such "unusual" things as timed mutex waits CAN make
sense for carefully coded embedded application environments, and the folks in
the POSIX realtime working group worry about that sort of thing a lot. While
the concept of timed mutex waits was passed over for POSIX 1003.1c-1995 as too
specialized, the "additional realtime features" standard, 1003.1d, (still in
draft form), adds pthread_mutex_timedwait.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


[If you *really* need a timed mutex, you can look at the sample code for
timed mutexes on this web page -- Bil]



=================================TOP===============================
 Q207: Does pthreads has an API for configuring the number of LWPs?  

"Hany Morcos (CS)" wrote:

>    Hi, does pthreads has an API for configuring the number of
> lwps for a specific sets of threads.  Or does most OS's assign
> an lwp per a group of thread.

The short answer: PThreads, no.  But UNIX98 includes a
pthread_setconcurrency() extension to the POSIX thread API.

The long answer:

First, "LWP" is a Solaris-specific (actually, "UI thread" specific, but
who cares?) term for a kernel thread used to allow a threaded process to
exploit O/S concurrency and hardware parallelism. So "most OS"s don't
have LWPs, though they do have some form of kernel threads.

(Note, this is all probably more than you want or need, but your
question is rather "fuzzy", I tend to prefer to give "too much"
information rather than "not enough", and for some reason I appear to be
in a "talkative" mood... ;-) )

POSIX 1003.1c-1995 ("pthreads") deliberately says very little about
implementation details, and provides few interfaces specifically to
control details of an implementation. It does allow for a two-level
scheduler, where multiple kernel threads and POSIX threads interact
within a process, but provides only broad definitions of the behavior.
There is no way to directly control the scheduling of PCS threads
("Process Contention Scope", or "user mode") onto "kernel execution
entities" (kernel threads). Although there is a mechanism to avoid
user-mode scheduling entirely, by creating SCS ("System
Contention Scope") threads, which must be directly scheduled by the
kernel. (Or at least must behave as if so scheduled, with respect to
threads in other processes.)

There's no form of "thread grouping" supported. Some systems have class
scheduling systems that allow you to specify relations between threads
and/or processes, but there's nothing of the sort in POSIX. (Nor, if
there were, would it necessarily group threads to an LWP as you
suggest.)

POSIX threads requires that a thread blocking for I/O cannot
indefinitely prevent other user threads from making progress. In some
cases, this may require that the implementation provide a new kernel
execution entity. It can do so either as a "last ditch" effort to
prevent completely stalling the process (as Solaris generally does, by
creating one additional LWP as the last current LWP in the process
blocks), or as a normal scheduling operation (as Digital UNIX does) to
always maintain a consistent level of PCS thread concurrency in the
process. (While I prefer the latter, and experience has shown that this
is what most people expect and desire, POSIX doesn't say either is right
or wrong; and in addition, there are costs to our approach that aren't
always repaid by the increased concurrency.)

UI threads was designed to allow/require the programmer to control the
level of process concurrency, and Sun's POSIX thread implementation uses
the same thread scheduler as their UI thread implementation. While the
"last ditch" LWP creation prevents indefinite stalls of I/O-bound
applications, it doesn't help applications with multiple compute-bound
threads, (the implemenation doesn't time-slice PCS threads). And, at
best, the model allows the process concurrency to be reduced to 1 before
offering any help. (Digital UNIX does time-slice PCS threads, so
compute-bound threads can coexist even on a uniprocessor [though this
isn't the most efficient application model, it's common and worth
supporting].) UI threads provides a thr_setconcurrency() call to allow a
careful programmer to dynamically "suggest" that additional LWPs would
be useful.

Due to Sun influence (and various other vendors who had intended
similarly inflexible 2-level schedulers), the Single UNIX Specification,
Version 2 (UNIX98) includes a pthread_setconcurrency() extension to the
POSIX thread API. Due to increasing cooperation between The Open Group
and PASC (the IEEE group that does POSIX), you can expect to see the
UNIX98 extensions appear in a future version of the POSIX standard. Note
that while this function is essential on Solaris, it has no purpose (and
does nothing) on Digital UNIX, (or on Linux, which uses only kernel
threads). I expect other vendors to move away from hints like
pthread_setconcurrency() as they (and their users) get more experience
with threading. The need for such hackery is largely responsible for the
unsettlingly common advice of UI thread wizards to avoid the Solaris
default of PCS threads ("unbound", in UI terminology) and to use SCS
threads ("bound") instead.

In some ways this is much like the old Win32 vs. Mac OS debate on
preemptive vs. cooperative multitasking. While cooperate multitasking
(or the simplistic/efficient Solaris 2-level scheduling) can be much
better for some class of applications, it's a lot harder to write
programs that scale well and that work the way users expect with
(unpredictable) concurrent system load. While preemptive multitasking
(or tightly integrated 2-level scheduling) adds (system) implementation
complexity and some unavoidable application overhead, it's easier to
program for, and, ultimately, provides more predictable system scaling
and user environment.

>    Wouldn't make more sense if one lwp blocks for a disk I/O
> instead of the entir program, when using grean threads.

"Green threads" is the user-level threading package for Java. It doesn't
use multiple kernel threads, and therefore cannot use hardware
parallelism or true I/O concurrency (although it has hooks to use
non-blocking UNIX I/O to, in many cases, schedule a new user thread
while waiting for I/O).

Modern implementations of Java should use native threads rather than
Green threads. In the case of a Solaris Java using UI threads or POSIX
threads rather than Green threads, disk I/O WOULD block only the LWP
assigned to the calling thread. There's no reason to be using Green
threads on any O/S that has real thread support!

>    I guess now it is very safe for multiple threads to directly
> write to a stream queue since write and read are thread safe.

Java I/O must be thread-safe. ANSI C I/O under a POSIX thread
implementation must be thread-safe. However, there's no standard
anywhere requiring that C++ I/O must be thread-safe -- nor is there for
most other languages. So you need to watch out HOW you write. If you're
writing Java or C, you're probably pretty safe. In any other language,
watch out unless you're using ANSI C/POSIX I/O functions directly.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/



=================================TOP===============================
 Q208: Why does Pthreads use void** rather than void*?  

Ben Elliston wrote:

> Wang Bin  writes:
>
> >     Today, when I was looking at thr_join(thread_t, thread_t*, void**),
> > I was suddenly confused by void* and void**. Why the third parameter
> > here is void** rather than void*?
>
> The third parameter is a void * so that the result can be anything you
> like--it's a matter of interpretation.  However, you need to pass the
> address of a void * so that the function can modify the pointer.

The POSIX thread working group wanted to specify a way to pass "any value"
to and from a thread, without making the interface really bulky and
complicated. The chosen way (it's often been debated whether the approach
was right, or good, but that's all irrelevant now) was to use "void*" as a
universal (untyped) value. It's NOT necessarily a pointer (though of course
it may be)... it's just an untyped value. The UI thread interface (defined
by many of the same people) has the same logic.

So when you create a thread, you pass in a "void*" argument, which is
anything you want. When a thread terminates, either by returning from its
start routine or by calling pthread_exit (or thr_exit), it can specify a
"void*" return value. When you join with the thread, you can pass the
function a POINTER to some storage that will receive this thread return
value. The storage to which you point must, of course, be a "void*".

Beware ("be very, very ware", as Pooh was once warned), because this
mechanism, while often convenient, is not at all type-safe. It's really easy
to get yourself into trouble. Do not, EVER pass the address of something
that's not "void*" into thr_join/pthread_join and simply cast the pointer to
(void**). For example, let's look at

     size_t TaskStatus;
         ......
     thr_join(..., ..., (void**)&TaskStatus;);

(This is slightly different from Ben's example. He cast the pointer to
"void*"... that'll work, since ANSI C is willing to implicitly convert
between any pointer type and "void*", but the parameter type is actually
"void**".)

What is the SIZE of size_t? Well, on conventional 32-bit system, size_t and
"void*" are probably both 32 bits. On a conventional 64-bit LLP system,
they're probably both 64 bits. But ANSI C doesn't require that conformity.
So what if size_t is a 32-bit "int", while "void*" is 64-bit? Well, now
you've got 32 bits of storage, and you're telling the thread library to
write 64 bits of data at that address. You've also told the compiler that
you really, really, for sure know what you're doing. But you really don't,
do you?

The construct is extremely common, but it's also extremely dangerous, wrong,
and completely non-portable! Do it like the following example, instead. It's
a little more complicated, but it's portable, that may save you a lot of
trouble somewhere. Your compiler might warn you that size_t is smaller than
void*, in cases where you might have otherwise experienced data corruption
by overwriting adjacent storage. If the original value passed by the thread
really WAS a size_t, the extra bits of the void* would have to be 0, [or
redundant sign bits if "size_t" is signed and the value is negative], and
losing them won't hurt you.

     void *result;
     size_t TaskStatus;
         ......
     thr_join(..., ..., &result;);
     TaskStatus = (size_t)result);

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

 =================================TOP===============================
 Q209: Should I use poll() or select()?  

[email protected] (W. Richard Stevens) writes:

>Second, I used to advocate select() instead of poll(), mainly because
>of portability, but these days select() is becoming a problem for
>applications that need *lots* of descriptors.  Some systems let you
>#define FD_SETSIZE to use more than their compiled-in limit (often
>256 or 1024), some require a recompile of the kernel for more than
>the default, and some require a recompile of the library function
>named select().  For these applications poll() is better, as there
>is no inherent limit (other than the per-process descriptor limit).

Indeed.

>Another complaint I had against poll() was how hard it was to remove
>a descriptor from the array (typical for network servers when a client
>terminates), but now you just set the descriptor to -1 and it's ignored.

But that was fixed a long, long time ago.

Which brings me to another advantage of poll(): you just specify the
events you are interested in once; poll uses a different field for the
result events.  (So no resetting of bits in the select masks).

Also, on Solaris, select() is a library routine implemented on top
of poll(); that costs too.  (Though on other systems it might be the
reverse)

Casper
--
 

=================================TOP===============================
 Q210: Where is the threads standard of POSIX ????  

try http://www.unix-systems.org/single_unix_specification_v2/xsh/threads.html
 

=================================TOP===============================
 Q211: Is Solaris' unbound thread model braindamaged?  

"Doug Royer [N6AAW]" wrote:

> Did you have a specifc braindamaged bug to report?
>
> In article <[email protected]>, Boris Goldberg  writes:
> >
> > I briefly browsed Solaris 7 docs at docs.sun.com and, regrettably,
> > it doesn't appear that they changed their braindamaged threading model.

Actually, I think Doug phrased that very well. In particular, he didn't use the
word "bug". He merely said "braindamaged". One might easily infer, (as I have),
that he's making the assumption that the "braindamaged" behavior is intentional,
and simply expressing regret that the intent hasn't changed.

Here's a few of the common problems with Solaris 2-level threading. I believe
one of them could accurately be described as a "bug" in Solaris (and that's not
confirmed). The others are merely poor design decisions. Or, in common terms,
"brain damage".

  1. thr_concurrency() is a gross hack to avoid implementing most of the 2-level
     scheduler. It means the scheduler puts responsibility for maintaining
     concurrency on the programmer. Nice for the Solaris thread subsystem
     maintainers -- not so nice for users. (Yes, UNIX has a long and
     distinguished history of avoiding kernel/system problems by complicating
     the life of all programmers. Not all of those decisions are even wrong.
     Still, I think this one is unnecessary and unjustifiable.)
  2. Rumor has suggested that Solaris creates one LWP by default even on SMP
     systems -- if that rumor is true, this condition might shade over the line
     into "true bug". But then, having an SMP isn't necessarily the same as
     being able to use it, so maybe that's deliberate, too.
  3. Blocking an LWP reduces the process concurrency. Yeah, sure the library
     will create a new one when the last LWP blocks, but that's not good. First,
     it means the process has been operating on fewer cylinders than it might
     think for some period of time. And, in many cases even worse, after the
     LWPs unblock, it will be operating on more cylinders than it can sustain
     until the LWPs go idle and time out. Running with more LWPs than processors
     is rarely a good idea unless most of them will always be blocked in the
     kernel. (I've heard unsubstantiated rumors that 2.6 did some work to
     improve on this, and 7 may do more; but I'm not inclined to let anyone "off
     the hook" without details.)
  4. While timeslicing is not required by POSIX, it is the scheduling behavior
     all UNIX programmers (and most who are used to other systems, as well)
     EXPECT. The lack of timeslicing in Solaris 2-level scheduling is a constant
     source of complication and surprise to programmers. Again, this isn't a
     bug, because it's clearly intentional; it's still a bad idea, and goes
     against the best interests of application programmers.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

 

=================================TOP===============================
 Q212:  Releasing a mutex locked (owned) by another thread.  
Zoom wrote:

> Hello, I have inherited the maintenance of a multi-threaded application.
> The application uses pthreads and runs on multiple platforms including
> solaris. On solaris it seems to be somewhat squirrely (the technical
> term of course :-) and I get random core dumps or thread panics.
> Absolutely not consistantly reproduceable. Sometimes it will go for
> hours or days cranking away and sometimes it will thread panic shortly
> after it starts up. In researching the the book "Multi-threaded
> Programming with Pthreads" by Bil Lewis et. al. I found on page 50 the
> statement to the effect that under posix it is illegal for one thread to
> release a mutex locked (owned) by another thread. Well, this application
> does that. In fact it does it quite extensively.
>
> Is there anyone willing to commit to the idea that this may be the
> source of the applications problems.

The answer is an absolutely, definite, unqualified "maybe". It depends
entirely on what the application is doing with those mutexes.

First, I want to be completely clear about this. Make no mistake, locking a
mutex from one thread and unlocking it from another thread is absoutely
illegal and incorrect. The application is seriously broken, and must be
fixed.

However, reality is a little more complicated than that. POSIX explicitly
requires that application programmers write correct applications. More
specifically, should someone write an incorrect application, it explicitly
and deliberately does NOT require that a correct implementation of the
POSIX standard either DETECT that error, or FAIL due to that error. The
results of programmer errors are "undefined". (This is the basis of the
POSIX standard wording on error returns -- there are "if occurs" errors,
which represent conditions that the programmer cannot reasonably
anticipate, such as insufficient resources; and there are "if detected"
errors, which are programmer errors that are not the responsibility of the
implementation. A friendly/robust implementation may choose to detect and
report some or all of the "if detected" errors -- but even when it fails to
detect the error, it's still the application's fault.)

The principal difference between a binary semaphore and a mutex is that a
mutex carries with it the concept of "ownership". It is that characteristic
that makes it illegal to unlock the mutex from another thread. The locking
thread OWNS the mutex, exclusively, until it unlocks the mutex. IF an
implementation can (and chooses to) detect and report violations of the
ownership protocol, the erroneous attempt at unlock will result in an EPERM
return. However, this is a programmer error. It is often unreasonably
expensive to keep track of which thread owns a mutex: an instruction (or
kernel call) to determine the identity of the locking thread may take far
longer than the basic lock operation. And of course it would be equally
expensive to check for ownership during unlock.

Many implementations of POSIX threads, therefore, do not record, or check,
mutex ownership. However, because it's a mutex, it IS owned, even if the
ownership isn't recorded. The next patch to your operating system might add
checking, or it might be possible to run threaded applications in a
heavyweight debug environment where mutex ownership is recorded and
checked... and the erroneous code will break the application. It'll be the
application's (well, the application developer's) fault.

Anyway, IF the implementation you're using really doesn't record or check
ownership of mutexes. And IF that illegal unlock is done as part of a
carefully managed "handoff" protocol so that there's no chance that the
owner actually needs the mutex for anything. (And, of course, if this
bizarre and illegal protocol is actually "correct" and consistent.) THEN,
your application should work despite the inherent illegality.

You could switch to a binary semaphore, and do the same thing without the
illegality. The application still won't WORK if you're releasing a lock
that's actually in use.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q213:  Any advice on using gethostbyname_r() in a portable manner?  

>>>>> "Tony" == Tony Gale  writes:

    Tony> Anyone got any advice on using gethostbyname_r in a portable
    Tony> manner?  It's definition is completely different on the
    Tony> three systems I have looked at. Autoconf rules would be nice
    Tony> :-)

Sorry, no autoconf rules.  Here's what I did in a similar situation:

{
    struct hostent *hentp = NULL;
    int herrno;
    uint32 ipnum = (uint32)-1;
 
#if defined(__GLIBC__)
    /* Linux, others if they are using GNU libc.  We could also use Posix.1g
       getaddrinfo(), which should eventually be more portable and is easier to
       use in a mixed IPv4/IPv6 environment. */
    struct hostent hent;
    char hbuf[8192];
    if (gethostbyname_r(hostname, &hent;,
                        hbuf, sizeof hbuf, &hentp;, &herrno;) < 0) {
        hentp = NULL;
    }
#elif defined(sun)
    /* Solaris 2.[456]. */
    struct hostent hent;
    char hbuf[8192];
    hentp = gethostbyname_r(hostname, &hent;, hbuf, sizeof hbuf, &herrno;);
#elif defined(__osf__)
    /* On Digital Unix 4.0 plain gethostbyname is thread-safe because it uses
       thread specific data (and a h_errno macro).  HPUX is rumoured to use
       this method as well.  This will go wrong on Digital Unix 3.2, but this
       whole file is not going to compile there anyway because version 3.2 has
       DCE threads instead of Posix threads. */
    hentp = gethostbyname(hostname);
    herrno = h_errno;
#else
#error I do not know how to do reentrant hostname lookups on this system
#endif
 
    if (hentp == NULL) {
        /* Digital Unix doesn't seem to have hstrerror :-(. */
        hmddns_logerror("gethostbyname(%s): %d", hostname, herrno);
    } else {
        memcpy(&ipnum;, hentp->h_addr, sizeof ipnum);
    }
    return ipnum;
}

Regards,
Bas.
 

From: David Arnold 

we're using this ...


OLDLIBS=$LIBS
LIBS="$LIBS $LIB_GHBN_R"
AC_CHECK_FUNC(gethostbyname_r, [
  AC_DEFINE(HAVE_GETHOSTBYNAME_R)
  AC_MSG_CHECKING([gethostbyname_r with 6 args])
  OLD_CFLAGS=$CFLAGS
  CFLAGS="$CFLAGS $MY_CPPFLAGS $MY_THREAD_CPPFLAGS $MY_CFLAGS"
  AC_TRY_COMPILE([
#   include 
  ], [
    char *name;
    struct hostent *he, *res;
    char buffer[2048];
    int buflen = 2048;
    int h_errnop;

    (void) gethostbyname_r(name, he, buffer, buflen, &res;, &h;_errnop)
  ], [
    AC_DEFINE(HAVE_GETHOSTBYNAME_R_6_ARG)
    AC_MSG_RESULT(yes)
  ], [
    AC_MSG_RESULT(no)
    AC_MSG_CHECKING([gethostbyname_r with 5 args])
    AC_TRY_COMPILE([
#     include 
    ], [
      char *name;
      struct hostent *he;
      char buffer[2048];
      int buflen = 2048;
      int h_errnop;

      (void) gethostbyname_r(name, he, buffer, buflen, &h;_errnop)
    ], [
      AC_DEFINE(HAVE_GETHOSTBYNAME_R_5_ARG)
      AC_MSG_RESULT(yes)
    ], [
      AC_MSG_RESULT(no)
      AC_MSG_CHECKING([gethostbyname_r with 3 args])
      AC_TRY_COMPILE([
#       include 
      ], [
        char *name;
        struct hostent *he;
        struct hostent_data data;

        (void) gethostbyname_r(name, he, &data;);
      ], [
        AC_DEFINE(HAVE_GETHOSTBYNAME_R_3_ARG)
        AC_MSG_RESULT(yes)
      ], [
        AC_MSG_RESULT(no)
      ])
    ])
  ])
  CFLAGS=$OLD_CFLAGS
], [
  AC_CHECK_FUNC(gethostbyname, AC_DEFINE(HAVE_GETHOSTBYNAME))
])
LIBS=$OLDLIBS


> Whom do I shoot?

take your pick :-(


-- David Arnold  
CRC for Distributed Systems Technology         +617 33654311   (fax)
University of Queensland                    [email protected] (email)
Australia                        (web) 

=================================TOP===============================
 Q214: Passing file descriptors when exec'ing a program.  

Jeff Garzik wrote:
> 

> My MT program must send data to the stdin of multiple processes.
> It also needs to read from the stdout of those _same_ processes.
> 
> How can this be done?

use the dup() function to save your parent stdin and stdout (if needed).
For each child process do:
   create two pipe()'s
   close stdin
   dup() one end of the first pipe
   close stdout
   dup the other end of the second pipe
   fork()
   exec()
   close unused ends of pipes
   save the pipe fd's for later use
restore parents stdin and stdout (if needed)
add pipe fd to fdset_t
use select() call to detect when child input from pipe is available


From quick Web search for examples:

  http://www.esrf.fr/computing/bliss/css/spec/help/piper.html
  http://www1.gly.bris.ac.uk/~george/unix-procs/papif-nops.c
  http://www.mit.edu/afs/athena/user/g/h/ghudson/info/pty.c

A book? Hard to be a top notch Unix programmer without this one on your
shelf:

  Advanced Programming in the Unix Environment
  W. Richard Stevens , Addison-Wesly Publishing
  ISBN 0-201-56317-7

Good luck!

% use the dup() function to save your parent stdin and stdout (if needed).

Good suggestion, although I'd suggest using dup2() to replace stdin and
stdout with the pipe ends.  If you do this, you have to be careful about
any code that uses stdin and stdout in the rest of your program -- you've
got to be sure you never try to use these handles while they're being set
up for the child process.
--

Patrick TJ McPhee
East York  Canada

=================================TOP===============================
 Q215:  Thread ID of thread getting stack overflow?   
Kurt Berg wrote:

> We are seeking a PORTABLE way of getting the thread ID
> of a thread experiencing a stack overflow.  We have to do
> some post processing to try to determine, given the thread
> ID, what sort of thing to do.
>
> It is our understanding that pthread_self is NOT "async
> signal safe".
>
> Thanks in advance.

Umm, as I mentioned in my reply to your email, once you buy into the
concept of doing "portable" things in a signal handler (which represents
a serious ERROR within the process), you're climbing a steep slope with
no equipment. Your fortune cookie says that a disastrous fall is in your
future.

I also commented that, although pthread_self isn't required by the
standard to be async-signal safe, it probably IS, (or, "close enough"),
on most platforms. And in a program that's done something "naughty" and
unpredictable, asynchronously, to its address space, that's as good as
you're going to get regardless of any standard.

However, you neglected to mention in your email that the SIGSEGV you
wanted to handle was a stack overflow. Now this leads to all sorts of
interesting little dilemmas that bring to mind, (among other things),
Steven Wright's famous line "You can't have everything: where would you
put it?" (Actually, the answer is that if you had everything, you could
leave it right where it was, but that's beside the point.) Your system
is telling you that you've got no stack left. While some systems might
support a per-thread alternate signal stack, that's not required by the
standards (and, in any case, it's kinda expensive since you need to
allocate an alternate stack for each thread you create). So... you've
used all your stack, and you want to handle the error. On what? The
stack you've used up? Or the other stack that you can't even designate
portably?

Sure, on SOME systems, you may be able to determine (at least sometimes)
that you're "near the end" of the stack, before you're actually there.
The Alpha calling standard, for example, requires the compiler to
"probe" the stack before changing the stack pointer to allocate a new
frame. Thus, if the probe generates a SIGSEGV, you've still got whatever
it was you were trying to allocate. MAYBE that's enough to run a signal
handler.

Unfortunately, "maybe", "sometimes", and "some systems" are not words
that contribute to a portable solution.

The answer is that you're as out of luck as your thread (even if you
still have stack). What you want to do is DEBUGGING... so leave it to
the debugger. Make sure that SIGSEGV is set to SIG_DFL. Let the ailing
process pass away peacefully, and analyze the core file afterward. (And
if you're faced with a system that doesn't support analysis of a
threaded process core file, then find a patch... or turn around and face
another system.)

And if you're just trying to leave a log entry to later trace the
failure of some reliable networking system, remember that a thread ID is
transient and local. It means absolutely nothing within another process,
or even at some other time within the same process. Why would you want
to log it? Without the core file, the information is useless; and with
the core file, it's redundant.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q216:   Why aren't my (p)threads preemted?  
 
Lee Jung Wooc wrote:

> I have my opinion and question.
>
> IMO, the cases that showed up "thread 2" is not by cpu preemption, but by
> normal scheduling. The printf() call in thread function induces a system call
> write( ). but the printf is a library function and will not lose cpu until
> some amount of bytes are stored in the buffer and fflush () is  called. The
> fflush() calls write() then switching occurs. Library buffer size may
> influences when the switching occurs and also the second pthread_create call
> may switch cpu to the first thread.

A thread MAY be timesliced at any time, on any system that supports
timeslicing. As I said, while SCHED_RR requires timeslicing, SCHED_OTHER does
not prohibit timeslicing. (Only SCHED_FIFO prohibits timeslicing.)

In addition to system calls, a thread might block on synchronization within the
process. For example, the buffer written to by printf() is shared among all
threads using stdout, and has to be synchronized in some fashion. Usually, that
means a mutex (or possibly a semaphore). If multiple threads use printf()
simultaneously (and, especially with one of the predefined file streams, like
stdout, it needn't be used by any of YOUR threads to have simultaneous access),
one of them may BLOCK attempting to acquire the synchronization object. That
would result in a context switch.

> I'm assuming the write() call , known to be none-blocking in normal cases,
> can lead to switching.  IMHO, the word none-blocking means that the calling
> context is not scheduled after a context(thread or process, whatever) which
> is assumed to be waitng for an event infinitively.

That's "non-blocking", not "none-blocking". (I mean no disrespect for your
English, which is far better than I could manage in your language, but while I
can easily ignore many "foreign speaker" errors, this one, maybe especially
because you chose to define it, stood out and made me uncomfortable.

> Is my assumtion correct ?

I'm afraid I can't make much sense of your second sentence.

DIGRESSION ALERT (including slings and arrow not specifically targeted to, nor
especially deserved by, the person who wrote the quoted phrase): I find it
difficult to read anything that starts with "IMHO", an abbreviation that I
despise, and which is almost always hypocritical because an opinion one takes
such care to point out is almost never intended to be "humble". It's quite
sufficient to simply declare your opinion. I, and all other responsible
readers, will assume that EVERYTHING we read is in fact the author's opinion,
except (perhaps) when the author specifically claims a statement to be
something else. And even then we'll question whether the author has in fact the
authority and knowledge to make such a claim. In the rare cases where it might
actually be useful to explain that your opinion is your opinion, you might
simply say so without the cloying abbreviation.

With that out of the way, where were we? Oh yes, write().

It's true that most traditional UNIX systems have a bug by which I/O to a file
oriented device is not considered capable of "blocking". That's unfortunate.
It's particularly unfortunate in a threaded world, because some other thread
might be capable of doing a lot of work even while an I/O works its way through
to the unified buffer cache; much less if the data must actually be written to
a remote NFS file system. In any case, this does not usually apply to other
types of file system. If stdout is directed to a terminal, or to a network
socket, the printf()'s occasional write() SHOULD result in blocking the thread.
The write() syscall IS, technically, a "blocking function", despite the fact
that some calls to it might not block. Being a "blocking function" does not
necessarily require that every call to that function block the calling thread.

> As far as, I know threre's no implemention of SCHED_RR in major unix
> distributions. Neither I think the feature is on definite demand.

I believe that Linux implements SCHED_RR fully. I know that Digital UNIX does,
and always has. I have some reason to believe that Solaris 7 implements
SCHED_RR, and I suspect that AIX 4.3 does as well. I'd be surprised if IRIX
(known for its realtime support) didn't support SCHED_RR (not that I haven't
already been surprised by such things). I don't have any idea whether HP-UX
11.0 has SCHED_RR... perhaps that's the "major unix distribution" you're
talking about?

As for demand. Oh yes, there's a very high (and growing) demand, especially for
the sort of "soft realtime" scheduling (and not always all that "soft") that
can be used to build reliable and "highly available" network server systems.
Anybody who doesn't support SCHED_RR either has no customers interested in
networking, or you can safely bet cash that they've already received numerous
requests, bribes, and threats from customers with names that just about anyone
would recognize.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

 
=================================TOP===============================
 Q217: Can I compile some modules with and others without _POSIX_C_SOURCE?  

Keith Michaels wrote:

> The _malloc_unlocked looks REAL suspicious now.  Am I getting the wrong
> malloc linked into my program?  The program contains two modules compiled
> separately: all the posix thread stuff is compiled with
> -D_POSIX_C_SOURCE=199506L, and the non-posix module is compiled without it.
> This is necessary because many system interfaces that I need are not
> available in posix (resource.h, bitmap.h, sys/fs/ufs*.h do not compile
> under posix).
>
> Is the traceback above evidence I have built the program incorrectly?

I don't know whether the traceback is evidence, but, regardless, you
HAVE built the program incorrectly. I don't know whether that incorrectness is
relevant. It's hard to believe that source files compiled without "thread
support" on Solaris would be linked to a non-thread-safe malloc() -- but, if
so, that could be your problem.

You don't need to define _POSIX_C_SOURCE=199506L to get thread support, though
that is one way to do it. Unfortunately, as you've noted, defining that symbol
has many other implications. You're telling the system that you intend to
build a "strictly conforming POSIX 1003.1-1996 application", and therefore
that you do not intend to use any functions or types that aren't defined by
that standard -- and in addition that you reserve the right to define for your
own use any symbols that are not specifically reserved by that standard for
implementation use.

Solaris, like Digital UNIX, (and probably others, though I don't know), has a
development environment that, by default, supports a wide range of standard
and non-standard functions and types. That's all fine, as long as they don't
conflict and as long as the application hasn't required that the environment
NOT do this, as by defining _POSIX_C_SOURCE. To compile threaded code on
Solaris (or Digital UNIX) that is not intended to be "strictly conforming
POSIX 1003.1-1996" you should define only the symbol _REENTRANT. You'll get
the thread-safe versions of any functions or symbols (e.g., errno) where
that's relevant, without restricting your use of non-POSIX capabilities of the
system. DEC C on Digital UNIX provides the proper defines when you compile
with "cc -pthread". I believe that Solaris supports "cc -mt", (though I didn't
know about that the last time I tried to build threaded code on Solaris, so I
haven't checked it).

Don't use -D_POSIX_C_SOURCE=199506L unless you really MEAN it, or if the
system you're using doesn't give you any alternative for access to thread
functions. (As I said, you never need it for Digital UNIX or Solaris.) And
always build ALL of the code that you expect to live in a threaded process
with the correct compiler options for your system. Otherwise, at best, they
may disagree on the definition of common things like errno; and, at worst, the
application may not be thread-safe.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


--

Patrick TJ McPhee
East York  Canada

=================================TOP===============================
 Q218: timed wait on Solaris 2.6?  
 
[email protected] wrote:

> I read from somewhere that pthread_cond_timedwait should only be
> used "in realtime situations".  Since Solaris doesn't support the
> realtime option of pthread, does it mean pthread_cond_timedwait
> should not be used on Solaris at all?

Condition variables are a "communication channel" between threads that are
sharing data. You can wait for some predicate condition to change, and you
can inform a thread (or all threads) waiting for a condition that it has
changed.

There's nothing intrinsically "realtime" about them, at that level.

You can also have a wait time out after some period of time, if the
condition variable hasn't been signalled. That's not really "realtime",
either, although the nanosecond precision of the datatype does originate
in the needs of the realtime folk who developed 1003.1b (the realtime
POSIX extensions).

On a system that supports the POSIX thread realtime scheduling option
(which, as you commented, Solaris 2.6 doesn't support -- though it
erroneously claims to), multiple threads that have a realtime scheduling
policy, and are waiting on a condition variable, must be awakened in
strict priority order. That, of course, is obviously a realtime constraint
-- but it doesn't apply unless you have (and are using) the realtime
scheduling extensions.

> I tried to use pthread_cond_timedwait in my application and got
> various weird results.
>
> 1. Setting tv_nsec doesn't seem to block the thread at all.  I
>    guess Solaris might just ignore this field (the value I gave
>    was 25,000,000).

Define "at all". How did you attempt to measure it? By looking at the
sweep second hand on your watch? Using time(1)? Calling gettimeofday()
before and after? Querying a hardware cycle counter before and after? Your
25000000 nanoseconds is just 25 milliseconds.

However, what may have happened is that you specified tv_sec=0, or
time(NULL), and then set tv_nsec to 25000000. With tv_sec=0, that's a
long, long way in the past, and the wait would indeed timeout immediately.
Even with tv_sec=time(NULL), remember that you may well have a nanosecond
"system time" of .026 seconds, and you're setting an absolute timeout
of .025. You really shouldn't use time(NULL) to create a struct
timespec timeout. You should use clock_gettime(). If you want to use small
waits, you may also need to check clock_getres(), which returns the clock
resolution. If your system supports a resolution of 0.1 second, for
example, there's not much point to setting a wait of 0.025 seconds.
(You'll get up to a 0.1 second wait anyway.)

> 2. The thread blocks and yields fine if I use "time(NULL) + 1" in
>    the tv_sec field.  However the thread eventually hangs in some
>    totally irrelevant code (in the system code `close' when I try
>    to close a socket descriptor).

There's no connection between condition waits and sockets, so most of this
item seems completely irrelevant.

> We are thinking of using another thread that sleeps (with nanosleep)
> for a period of time and then wakes up and signals other threads
> as a timer now.  Has anyone tried this approach before?

Depends on what you are really trying to accomplish. I don't see any
application of this technique that has anything to do with the rest of
your message.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q219:  Signal delivery to Java via native interface  

"Stuart D. Gathman" wrote:

> Dave Butenhof wrote:
> >
> I am trying to figure out a way to handle signals synchronously in a Java VM.
> I have a thread calling sigwait() which reports supported signals
> synchronously to Java.  But I have no control over other threads in the VM -
> so I can't get them to block the signals.  The sneaky solution of blocking the
> signal in a handler probably won't work in AIX - the man page says "Concurrent
> use of sigaction and sigwait for the same signal is forbidden".

It cannot legally say that, and it may not be saying what it seems to. There's no
restriction in POSIX, or in UNIX98, against using both. However, POSIX does say that
calling sigwait() for some signal MAY change the signal action for that signal. If
you have a silly implementation that actually does this (there's no point except
with a simple purely user-mode hack like the old DCE threads library), then trying
to combine them may be pointless -- but it's not illegal. (And, by the way, if
you're using any version of AIX prior to 4.3, then you ARE using that very
"user-mode hack" version of DCE threads, and you're not really allowed to set signal
actions for any "asynchronous" signal.)

Of course, in practice, such distinctions between "forbidden" and "legal but
meaningless" aren't useful, so one could argue that the incorrect statement "is
forbidden" may not be entirely unjustified. ;-)

> One idea is to have the handler notify the signal thread somehow - not
> necessarily with a signal.  Is there some kind of event queue that could be
> called from a signal handler?

You can call sem_post() from a signal handler. Therefore, you could have a thread
waiting on a semaphore (sem_wait()), and have the signal call sem_post() to awaken
the waiter.

> Another idea is to have the signal thread call sigsuspend.  Then, if the
> handler could determine whether the thread it interrupted is the signal
> thread, it could block the signal all threads except the signal thread.

I don't think I understand what you mean here. One thread cannot block a signal in
other threads. And that "if" hiding in the phrase "if the handler could determine"
is a much bigger word that it might seem. You cannot do that using any portable and
reliable mechanism.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q220: Concerning timedwait() and realtime behavior.  

Bil Lewis wrote:

>   First, the definition I'm using of "realtime" is real time i.e.,
> wall clock time.  In computer literature the term is not well-defined,
> though colloqually "soft realtime" means "within a few seconds" while
> "hard realtime" means "less than 100ms."  (I've been quite upset
> with conversations with realtime programmers who talk about 100%
> probabilities, time limits etc.  Ain't no such thing! This muddies
> the RT waters further, but we'll leave that for another time.)
>
>   As such, anything that refers to the wall clock (eg pthread_cond_timedwait())
> is realtime.

"Realtime" means "wall clock time"? Wow.

>   I think important that people using timed waits, etc. recognize this
> and write programs appropriately.  (I admit to some overkill here, but
> think some overkill good.)

Yeah, it's important to remember that you're dealing with an "absolute" time
(relative to the UNIX Epoch) rather than a "relative" time (relative to an
arbitrary point in time, especially the time at which the wait was initiated). The
sleep(), usleep(), and nanosleep() functions are relative. The
pthread_cond_timedwait() function is absolute. So if "realtime" means "absolute
time" (which has some arbitrary correlation, one might assume, to "wall clock
time"), then, yeah, it's realtime.

> > Condition variables are a "communication channel" between threads that are
> > sharing data. You can wait for some predicate condition to change, and you
> > can inform a thread (or all threads) waiting for a condition that it has
> > changed.
> >
> > There's nothing intrinsically "realtime" about them, at that level.
> >
> > You can also have a wait time out after some period of time, if the
> > condition variable hasn't been signalled. That's not really "realtime",
> > either, although the nanosecond precision of the datatype does originate
> > in the needs of the realtime folk who developed 1003.1b (the realtime
> > POSIX extensions).
>
>   And here's the sticky point: 'That's not really "realtime"'.  It sure
> isn't hard realtime.  (Many people don't refine their terms.)  But it is
> real time.

Reality is overrated. It certainly has little to do with programming. No, it's not
"realtime" by any common computer science/engineering usage. "Realtime" isn't a
matter of the datatype an interface uses, but rather of the real world constraints
placed on the interface!

An interface that waits for "10 seconds and 25 milliseconds plus or minus 5
nanoseconds, guaranteed, every time" is realtime. An interface (like
pthread_cond_timedwait()) that waits "until some time after 1998 Dec 07
13:08:59.025" is not realtime, because no (useful) real world (real TIME)
constraints are placed on the behavior.

>   So what's my point?  Maybe just that we need some well-defined terminology
> here?

We're talking about POSIX functions, so let's try out the POSIX definition:

     "Realtime in operating systems: the ability of the operating system to
     provide a required level of service in a bounded response time."

Does pthread_cond_timedwait() "provide a required level of service in a bounded
response time"? No, absolutely not, except in conjunction with the scheduling
guarantees provided by the realtime scheduling option.

Of course, in a sense, it is bounded -- pthread_cond_timedwait() isn't allowed to
return BEFORE the specified time. But that's not a useful bound. What realtime
people want is the other direction... "if I wait for 25 milliseconds, what, worst
case, is the LONGEST interval that might pass before control is returned to my
application".

You're correct that "hard" and "soft" realtime aren't quite so firmly defined. In
normal use, soft realtime usually means that it shouldn't be too long, most of the
time, or someone'll get annoyed and write a firmly worded letter of protest. Hard
realtime means the plane may crash if it's too long by more than  nanoseconds.
Hard realtime does not necessarily require fine granularity, or even 100%
precision (though some applications do require this). The principal requirement is
predictability.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q221: pthread_attr_getstacksize on Solaris 2.6  

[email protected] wrote:

> I am trying to find out what default stack size each thread has by the
> following code (taken from Pthreads Programming) in non-main threads:
>
>   size_t default_stack_size = -1;
>
>   pthread_attr_t stack_size_custom_attr;
>
>   pthread_attr_init( &stack;_size_custom_attr );
>
>   pthread_attr_getstacksize( &stack;_size_custom_attr, &default;_stack_size );
>   printf( "Default stack size = %d\n", default_stack_size );
>
> The output is 0. Can anyone explain this? Thanks.

Yes, I can explain that. "0" is the default value of the stacksize attribute on
Solaris. Any more questions? ;-)

POSIX says nothing about the stacksize attribute, except that you can set it to
the size you require. It doesn't specify a default value, and it doesn't
specify what that default means. It does say that any attempt to specify a
value less than PTHREAD_STACK_MIN is an error. Therefore, it's perfectly
reasonable (though odd and obscure) to have a default of 0, which is distinct
from any possible value the user might set.

When you actually create a thread, the Solaris thread library looks at the
stacksize attribute, and, if it's 0, substitutes the actual runtime default.
That's pretty simple.

I happen to prefer the way I implemented it. (I suppose that goes without
saying.) When you create a thread attributes object, the stacksize attribute
contains the actual default value, and the code you're using will work.

But the real point, and the lesson, is that what you're trying isn't portable.
While it's not quite "illegal", it's close. Another way of putting it is that
you've successfully acquired the information you requested. The fact that it
happens to be absolutely useless to you is completely irrelevant.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q222:  LinuxThreads: Problem running out of TIDs on pthread_create  
 
Kaz Kylheku wrote:

> ( The comp.programming.threads FAQ wrongfully suggests a technique of using a
> suitably locked counter that is incremented when a detached thread is created
> and decremented just before a detached thread terminates. The problem is that
> this does not eliminate the race condition, because a thread continues to
> exist after it has decremented the counter, so it's possible for the counter
> to grossly underestimate the actual number of detached threads in existence.
> I'm surprised at this *glaring* oversight. )

[This is true, but not very likely.  Like never.  Still, Kaz is right.  -Bil]

In most programs this is simple and reliable, because threads tend to execute the
short amount of code at the end without blocking. That, of course, is not always
true. And you're correct that in some cases, especially with a heavily loaded
system, there can be delays, and they can be significant.

> The only way to be sure that a detached thread has terminated is to poll it
> using pthread_kill until that function returns an error, which is ridiculous.
> That's what joinable threads and pthread_join are for.

That won't work unless you know that no new threads are being created during the
interval. (Anywhere in the process... and you can only know that if you're
writing a monolithic application that calls no external code.) That's because a
POSIX thread ID (pthread_t) may be reused as soon as a thread is both terminated
and detached. (Which, for a detached thread, means as soon as it terminates.)
This won't always happen, and, in some implementations, (almost certainly
including Linux, which probably uses the process pid), may "almost never" happen.
Still, code that uses this "trick" isn't portable, or even particularly reliable
on an implementation where it happens to work most of the time.

Your summary is absolutely correct: that's why join exists.

> Because of this race, you should never create detached threads in an unbounded
> way. Programs that use detached threads should be restricted to launching a
> *fixed* number of such threads.
>
> I don't believe that detached threads have any practical use at all in the
> vast majority of applications.  An application developed in a disciplined
> manner should be capable of an orderly shutdown during which it joins all of
> its threads.  I can't think of any circumstance in which one would absolutely
> need to create detached threads, or in which detached threads would provide
> some sort of persuasive advantage; it's likely that the POSIX interface for
> creating them exists only for historic reasons.

I believe that detached threads are far easier to use for the vast majority of
programs. Joining is convenient (but not necessary) for any thread that must
return a single scalar value to its creator. Joining is essential when you need
to be "reasonably sure" that the thread has given up its system resources before
going on to something else. In any other case, why bother? Let your threads run
independently, exchange information as needed in any reasonable way, and then
quietly "evaporate" in a puff of greasy black smoke.

> (The FAQ, however, for some unexplained reason, suggests that detached threads
> are preferred over joinable.)

[Personal preference only -Bil]

And I agree, though they're clearly not appropriate in situations where you're
flooding the system with threads (which isn't a design I'd recommend anyway), and
you really need to know when one is gone to avoid resource problems.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

 

=================================TOP===============================
 Q223:  Mutexes and the memory model  
 

Kaz Kylheku wrote:

> In article ,
> Keith Michaels  wrote:
> >I know that mutexes serialize access to data structures and this
> >can be used to enforce a strongly ordered memory model.  But what
> >if the data structure being locked contains pointers to other
> >structures that were build outside of mutex control?
>
> The mutex object is not aware of the data that it is protecting; it is
> only careful programming discipline that establishes what is protected.
> If some pointers are protected by a mutex, it may be the case that the
> pointed-at objects are also protected. Or it might be the case that such
> objects are not protected by the mutex.
>
> Any object that is accessed only whenever a certain mutex is held is
> implicitly protected by that mutex.

This is a really good statement, but sometimes I like to go the opposite
direction to explain this.

The actual truth is that mutexes are selfish and greedy. They do NOT protect your
data, or your code, or anything of the sort. They don't know or care a bit about
your data. What they do, and very well, is protect themselves. Aside from mutual
exclusion (the "bottleneck" function), the law says that when you lock a mutex,
you have a "coherent view of (all) memory" with the thread that last unlocked the
mutex. If you follow carefully follow the rules, that is enough.

As Kaz says, you need to apply careful programming discipline in order to be
protected by a mutex. First, never touch shared data when you don't have a mutex
locked... and all the threads potentially touching any shared data must agree on
a single mutex (or the same set of mutexes) for this purpose. (If one thread
locks mutex A to touch a queue, while another locks mutex B to touch a queue,
you've got no protection.) And, if you are moving a piece of data between
"private" and "shared" scopes, you must agree on a single mutex for the
transition. (You can modify private data as you wish, but you must always lock a
mutex before making that private data shared, and before making shared data
private again -- as in queueing or dequeueing a structure.) If your structure
contains pointers to other structures, then they're in the same scope. If there
may also be other pointers to the data, you need to make sure all threads agree,
at every point in time, whether the "secondary" data is private or shared.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

 
=================================TOP===============================
 Q224: Poor performance of AIO in Solaris 2.5?  

Bil Lewis wrote:
> 
> Douglas C. Schmidt wrote:
> >
> > Hi Mike,
> >
> > ++ I have an application that needs to write files synchronously (i.e: a
> > ++ database-like application). I figured I should try and use the "aio"
> > ++ family of system calls so that several such writes can be in progress
> > ++ simultaneously if the files are on different disks. (A synchronous write
> > ++ takes around 12-16 msecs typically on my machine.)
> > ++
> > ++ I would have expected that the lio_listio() would be no slower than 2
> > ++ write()'s in series, but it seems to be 4-5 times worse.
> >
> > Our ad hoc tests using quantify/purify seem to indicate that the
> > aio*() calls on Solaris are implemented by spawning a thread PER-CALL.
> > This is most likely to be responsible for the high overhead.  I'm not
> > sure how other operating systems implement the aio*() calls, but
> > clearly spawning a thread for each call will be expensive.
> 
>   I never worked with the AIO stuff, but this does sound correct...  AIO
> was done with threads & creating a new thread per AIO call sounds likely.
> But it's not terribly expensive.  Depending upon the machine it should
> add no more than 20-100us.  You wouldn't even be able MEASURE that.
> 
>   Something is rotten in the state of Denmark.

and I suspect its the scheduler.  I've written a threaded external
call-out/call-back system for our VisualWorks Smalltalk environment
(http://www.parcplace.com/products/thapi/).  It runs on Windows
NT/95/98, OS/2, Intel Linux, Digital Unix, HPUX, AIX and Solaris.  The
scheme maintains a thread-farm, and threads in the farm are used to make
external call-outs, and a rendevouz mechanism is used to respond to
threaded call-ins.

On all but Solaris the performance of a simple threaded call-out to a
null routine is approximately 50 times slower than a non-threaded
call-out (e.e. a simple threaded callout on Intel Linux using a 180 MHz
Pentium Pro is about 85 usec).  But on Solaris it is an order of
magnitude worse (e.g. a simple threaded callout on an UltraSPARC 1 takes
at least 800usecs).  

Since the system uses a thread farm, thread creation times aren't
relevant in determining performance.  Instead, the performance is
determined by pthread_mutex_lock, pthread_cond_signal,
pthread_mutex_unlock, pthread_cond_wait, pthread_cond_timed_wait and the
underlying scheduler.

Dormant threads in the farm are waiting in pthread_cond_wait.  When a
call-out happens the main/Smalltalk thread marshalls the call into some
memory, chooses a thread and does a
{pthread_mutex_lock;pthread_cond_signal;pthread_mutex_unlock} to wake
the thread and let it perform the call.  On return the thread signals
the main thread and enters a pthread_cond_timed_wait (if it times-out
the main thread is resignalled and the wait reentered).  The
main/Smalltalk thread responds to the signal by collecting the result of
the call.

To ensure calls make progress against the main thread all threads in the
farm have higher priority.  On many pthreads platforms, Solaris included
the system has to use a non-realtime scheduling policy because of a lack
of permissions, so on Solaris 2.5/2.6 the scheme is using SCHED_RR.  My
guess is that the scheduler is not prompt in deciding to wake-up a
thread, hence when the calling thread is signalled it isn't woken up
immediately, even though the thread has a higher priority.  One
experiment I've yet to try is putting a thr_yield (as of 2.5
pthread_yield is unimplemented) after the
{pthread_mutex_lock;pthread_cond_signal;pthread_mutex_unlock}.

Although this is all conjecture it does fit with a scheduler that only
makes scheduling decisions occasionally, e.g. at the end of a process's
timeslice.  Anyone have any supporting or contradictory information?


=================================TOP===============================
 Q225:  Strategies for testing multithreaded code?  

Date: Tue, 12 Jan 1999 12:41:51 +0100
Organization: NETLAB plus - the best ISP in Slovakia
 
>Subject says it all: are there any well known or widely used
>methods for ensuring your multithreaded algorithms are threadsafe?
>Any pointers to useful research on the topic?


Let us suppose a program property is an attribute that is true of every
possible history of that program (a history of a program being a concrete
sequence of program states, transformations from one state to another are
carried out by atomic actions performed by one or multiple threads).

Now what about being able to provide a proof that your program has safety
(absence of deadlock, mutual exclusion, ...) and liveliness
(partial/complete correctness, ...) properties?

To prove your program has absence of deadlock property, you may define an
invariant DEADLOCK that is true when all (cooperating) threads in your
program block. Then proving your program will not deadlock is very simple -
you need to assert that for every critical assertion C in the program proof:
C => not DEADLOCK (C implicates DEADLOCK invariant is false, in other words
when preconditions of program statements are true, they exclude possibility
of a state where deadlock is possible).

There is an excellent book covering this topic (the above is an awkward
excerpt from it):
Andrews, Gregory R.
"Concurrent Programming, Principles and Practice"
Addison Wesley 1991
ISBN 0-8053-0086-4

Applying propositions and predicates into your program (or rather sensitive
multithreaded parts) to assert preconditions and postconditions required for
atomic actions execution present a complication, of course. You have to
spend more time on annotating your algorithm, developing invariants that
have to be kept by every statement in the algorithm (and if not, you have to
guard against executing the statement until the invariant is true - and here
you have conditional synchronization :), proving program properties.

But I think it is worth it. Once you prove your program does not deadlock
using programming logic, you may be sure it will not. So I would suggest you
read the above book (if only to be aware of the techniques described there).
It is more a theoretical discussion, but many very helpful paralell
algorithms are described (and proved) there, starting with the very
classical dining philosophers problem, up to distributed heartbeat
algorithm, probe-echo algorithm and multiple-processor operating system
kernel implementation.

Hope this helps,
Best regards,
        Milan Gardian
 

=================================TOP===============================
 Q226: Threads in multiplatform NT   
Yes,

I have done this.

Jason

"Nilesh M." wrote:
> 
> Can I write threaded programs for Win NT and just recompile for both Alpha
> and i386 without any changes or minor changes?


=================================TOP===============================
 Q227:  Guarantee on condition variable predicate/pthreads?  

Pete Sheridan wrote:

> Thread 2:
>         pthread_mutex_lock(&m;);
>         if (n != 0)
>         {
>                 pthread_cond_wait(&m;, &c;);
>                 assert(n == 0);
>         }
>         pthread_mutex_unlock(&m;);
>
> The idea here is that thread 2 wants to wake up when n is 0.  Is the
> assert() correct?  i.e., will n always be 0 at that point?  When the
> condition is signalled, thread 2 has to reacquire the mutex. Thread 1
> may get the mutex first, however, and increment n before this happens.
> Is this implementation dependent?  Or does thread 2 have to use "while
> (n != 0)" instead of "if (n != 0)"?

The assert() is incorrect. The POSIX standard carefully allows for a
condition wait to return "spuriously". I won't go into all the details,
but allowing spurious wakeups is good for both the implementation and the
application. (You can do a search on Deja News if you really want to know,
because I've explained this several times before; or you can read my book,
about which you may learn through the www.awl.com link below.)

To correct Thread 2, change the "if" into a "while" and move the assertion
out of the loop. But then, it becomes rather trivial. You hold the mutex
and loop until n == 0, so, of course, it will be 0 when the loop
terminates (with the mutex still locked).

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q228:  Pthread API on NT?   
> I need to port a lot of code to NT that depends on pthreads. Has anyone
> built a pthread library on NT using Win32 threads?
> Scott

sourceware.cygnus.com/pthreads-win32

 
"Darius S. Naqvi" wrote:

> Dave Butenhof  writes:
>
> >
> > Lee Jung Wooc wrote:
> >
> > > Any one please help to redirect signal to the mainthread or
> > > any idea on how to make the signal to process is handled in
> > > main thread context ?
> >
> > As Patrick McPhee has already suggested, I recommend that you stop relying on
> > a signal handler for this. To expand a little on his advice, begin by masking
> > SIGUSR2 in main(), before creating any threads. Then create a special thread
> > that loops on the sigwait() function, waiting for occurrences of SIGUSR2. (If
> > the signal is not masked in ALL threads, then it may "slip through" somewhere
> > while the signal thread is not waiting in sigwait() -- for example, while it's
> > starting, or while it's responding to the previous SIGUSR2.)
> >
>
> Does the signal become pending in the sigwaiter thread in that case?
> To be clear: suppost that a given signal is blocked in all threads,
> and one thread sigwait()'s on it.  Suppose that the while the
> sigwait()ing thread is not actually in sigwait(), that signal is sent
> to the process.  Is the signal then pending in the sigwait() thread,
> so that the next call to sigwait() notices the signal?

If *all* threads have the signal blocked, the the signal remains
pending against the process. The next thread that makes
itself able to receive the signal, either by unblocking the
pending signal in it signal mask or by calling sigwait,
will receive the pending signal.

>
>
> I've been assuming that since a signal is received by only one of the
> threads in which it is not blocked, it is not made pending in the
> blocking threads *if* there exists a thread that is not blocking it.
> In order to not lose any signals, it must then be the case that if
> every thread is blocking a signal, then when a signal is sent to the
> process, it is made pending in *every* thread.  I.e., either one
> thread receives the signal and it is not made pending in any thread,
> or the signal is pending in every thread.  Is this true?  (I don't
> have a copy of the standard, but the book "Pthreads Programming" from
> O'Reilly and Associates is silent on this matter.)

Signals sent to a process never "pend" against a thread. They can
only be pending against the process, meaning, as I explained above,
that any qualified thread can eventaully take the signal.

Only per-thread signals, sent via pthread_kill() can be pending
against a thread that has the signal blocked.

Externally, it's not that complicated. Internally, it can interesting....

__________________________________________________
Jeff Denham ([email protected])

Bright Tiger Technologies:  Resource-management software
for building and managing fast, reliable web sites
See us at http://www.brighttiger.com

125 Nagog Park
Acton, MA 01720
Phone: (978) 263-5455 x177
Fax:   (978) 263-5547

  

=================================TOP===============================
 Q229:  Sockets & Java2 Threads  
 
 

Nader Afshar wrote:

> Part of a GateWay I am designing is composed of two threads. One thread
> delivers messages to a server through a socket connection, the other
> thread is always listening on the output-stream of the other server for
> incoming messages.
>
> The problem I'd like to solve is How to stop the second thread. Since
> that thread is blocked listening on the socket connection, I can not use
> the wait() and notify() method to stop it. Furthermore since Thread.stop
> is deprecated in Java2, I seem to be in a quandary!!
>
> Any suggestions, would be most appreciated.
>
> btw. I was thinking of using the socket time-out feature and then after
> checking for some state variable indicating a "disconnect" request,
> going back to listening on the socket again, but this approach just does
> not seem very "clean" to me.
>
> Regards
> Nader

[For Java 2, this works just fine.  See the Java code example 
ServerInterrupt on this web page. -Bil]

Yes, we had the same problem. interrupt() doesn't work reliable, if the
threads is blocking because of reading from a socket. Setting a variable
was also not very "clean", since you also have to set a timeout then for
reading.

I did it this way: I opened the socket in an upper thread and passed it to
the receiving thread. When I want to stop the thread, I simply clos the
socket. This causes the blocking read method to throw a Exception, that
could be caught. So the thread can end in a clean way.
This is also the method suggested by SUN. It seems, that there is not
better solution.

greetings
       Charly
 
>This is also the method suggested by SUN. It seems, that there is not
>better solution.


Despite being recommended by Sun (where do they recommend this?) it is not
guaranteed to work on all platforms. On some systems closing the Java socket
does not kick the blocked thread off. Such behaviour is not currently
required by the API specs.

David

=================================TOP===============================
 Q230: Emulating process shared threads   

"D. Emilio Grimaldo Tunon" wrote:

>    I was wondering if there is a way to emulate process shared
> mutexes and condition variables when the OS supports Posix
> threads but *not* process shared items? I know I can test
> for _POSIX_THREAD_PROCESS_SHARED, but if it is not defined,
> meaning that THAT is not implemented, then what are my
> alternatives? of course assuming there WILL be two processes
> wanting to share a mutex/condition variable.

Depends on your requirements, and how much work you want to do.

First, you could just use some existing cross-process mechanism to
synchronize. You could use a POSIX (or SysV) semaphore. A message queue.
You could use a pipe -- threads try to acquire the lock by reading, and
"release" the lock by writing (unblocking one reader). You could even
create a file with O_EXCL, and retry periodically until the owner
releases the lock by deleting the file.

You COULD emulate a mutex and condition variable yourself using
(completely nonportable) synchronization instructions, in the shared
memory region, and some arbitrary "blocking primitive" (a semaphore,
reading from a pipe to block and write to unblock, etc.) It can be a lot
of work, but it can be done.

There are a million alternatives. You just need to decide how important
the performance is, and how much time you're willing to spend on it.

You might also keep in mind that a few systems already support UNIX98
(Single UNIX Specification, Version 2), and the others will as soon as
the usual morass of overloaded and conflicting product requirements
allows. UNIX98 REQUIRES implementation of the "pshared" option.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/
 
=================================TOP===============================
 Q231: TLS in Win32 using MT run-time in dynamically loaded DLLs?  
 
In article <[email protected]>,
Mike Smith  wrote:
>That's a mouthful!  :-)
>
>Let me try again.  I'm writing a Win32 DLL that will be loaded dynamically
>(i.e. via LoadLibrary()).  This DLL will spawn multiple concurrent instances
>of the same thread, each of which must have some local variables.  I'd
>prefer if possible to use the runtime library _beginthread() or
>_beginthreadex() rather than the Win32 functions (CreateThread() etc.)
>Meanwhile, the docs for LoadLibrary() that come with VC++6 seem to indicate
>that dynamically loaded DLLs cannot have thread-local storage, at least not
>provided by the run-time library.

If you are talking about the Microsoft language extension 

    declspec(thread_local)  // or however you spell it

you should probably not be using it in the first place. It's best to get the
job done using the standard language syntax as much as possible and stay away
from compiler extensions. 

There is an alternative way to manage thread-local storage. Have a look
at tlsalloc() and friends. This API is a pale imitation of the POSIX
thread-specific keys facility, but it gets the job done. 

You CAN use thread-specific storage in DLL's if you use tlsalloc() Even though
tlsalloc() lacks the cleanup facility that its POSIX counterpart offers, if
you are writing your code strictly as a DLL you can hack in your own cleanup
and destruction of thread-specific data, since your DllMain is called each
time a thread is created or destroyed in the process.

>Has anybody run across situation before?  How did you handle it?  I was
>thinking about allocating the worker thread's local storage prior to
>starting the thread, then passing a pointer to the memory in the thread
>function's (void *) parameter.  Better ideas?

I usually do this dynamically. If an object requires a thread-specific pointer
to something, I will create the index (or key, in POSIX terminology) when
that object is constructed. Then as threads use the object, they each
initialize their corresponding slot when they detect that it's null.


=================================TOP===============================
 Q232:  Multithreaded quicksort  

Gordon Mueller  wrote in message
<[email protected]>...
>
> I'm looking for a multi-threaded/parallel implementation
> (source or detailed description) of the famous quicksort
> algorithm. I'd like to specify a maximum number of k (moderate)
> processors/threads and I'm looking for linear speed-up, of course.


Have a look at Chap. 20 in my book, "C Interfaces and Implementations:
Techniques for Creating Reusable Software "(Addison-Wesley Professional
Computing Series, 1997, ISBN 0-201-49841-3); there's a multi-threaded
implementation of quicksort in Sec. 20.2.1. The source code is available on
line; see http://www.cs.princeton.edu/software/cii/.

dave hanson

        ================ 

     Parallel quicksort doesn't work all that well; I believe the
speedup is limited to something like 5 or 6 regardless of the number
of processors.  You should be able to find a variety of parallel
sorting algorithms using your favourite search engine.  One you may
want to look at is PSRS (Parallel Sorting by Regular Sampling), which
works well on a variety of parallel architectures and isn't really
difficult conceptually.  You can find some paper describing it at

          http://www.cs.ualberta.ca/~jonathan/Papers/par.1993.html

          http://www.cs.ualberta.ca/~jonathan/Papers/par.1992.html

Steve
--
--
Steve MacDonald, Ph.D. Candidate  | Department of Computing Science
[email protected]             | University of Alberta
http://www.cs.ualberta.ca/~stevem | Edmonton, Alberta, CANADA  T6G 2H1
 
=================================TOP===============================
 Q233: When to unlock for using pthread_cond_signal()?  

POSIX specifically allows that a condition variable may be signalled or
broadcast with the associated mutex either locked or unlocked. (Or even
locked by someone else.) It simply doesn't matter. At least, signalling
while not holding the mutex doesn't make the program in any way illegal.

A condition variable is just a communication mechanism to inform waiters of
changes in shared data "predicate" conditions. The predicate itself IS
shared data, and must be changed in a way that's thread-safe. In most cases,
this means that you must hold the mutex when you change the data. (But you
could also have a predicate like "read() returns data", so that you could
write data, signal the condition variable -- and the waiter(s) would simply
loop on the condition wait until read() returns some data.)

The signal doesn't need to be synchronized with the predicate value. What
you DO need to synchronize is SETTING the predicate and TESTING the
predicate. Given that basic and obvious requirement (it's shared data, after
all), the condition variable wait protocol (testing the predicate in a loop,
and holding the mutex until the thread is blocked on the condition variable)
removes any chance of a dangerous race.

However, your scheduling behavior may be "more predictable" if you signal a
condition variable while holding the mutex. That may reduce some of the
causes of "spurious wakeups", by ensuring that the waiter has a slightly
better chance to get onto the mutex waiter list before you release the
mutex. (That may reduce the chance that some other thread will get the
mutex, and access to the predicate, first... though there are no
guarantees.)

(There's a lot more about this in my book, information on which can be found
through the www.awl.com link in my signature way down there at the bottom.)

> You see, pthread_cond_signal has no effect if nobody is actually waiting
> on the condition variable. There is no ``memory'' inside a condition variable
> that keeps track of whether the variable has been signalled. Signalling
> a condition variable is like shouting. If nobody is around to hear the
> scream, nothing happens.
>
> If you don't hold the lock, your signal could be delivered just after another
> thread has decided that it must wait, but just before it has actually
> performed the wait. In this case, the signal is lost and the thread will wait
> for a signal that never comes.

This would be true if you failed to hold the lock when SETTING the
predicate. But that has nothing to do with SIGNALLING the condition
variable. Either the predicate is changed before the waiter's predicate
test, or it cannot be changed until after the waiter is "on" the condition
variable, in position to be awakened by a future signal or broadcast.

You are correct that signalling (or broadcasting) a condition variable with
no waiters "does nothing". That's good -- there's nothing FOR it to do.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/


=================================TOP===============================
 Q234: Multi-Read One-Write Locking problem on NT  

Alok Tyagi wrote:

> We are encountering a problem with MRSW (Multi-Read Single Write) or
> SWMR/MROW Locks on Windows NT :-
>
> We have our own MRSW functionality implemented using mulitiple Semaphores.
> We are experiencing a problem when
> process holding a shared lock dies ungracefully and consequently, no other
> processes requesting the exclusive access succeed until the MRSW resource is
> removed and re-created. On Unix platforms, the OS SEM_UNDO mechanism can be
> used. Are you aware of any solution to this problem on NT?
>
> TIA,
>
> --alok

 
Hi,

It turns out that ntdll.dll provides undocumented MRSW support which you might
find of interest.  There is an article on it in the Jan. 1999 edition of the
Windows Developer's Journal (www.wdj.com).  I have not used it myself but it
looks interesting, if you understandably feel a bit shakey about using an
undocumented microsoft feature, the article provides an insite of how the MRSW
Lock is implemented.

Hope this is of help..

Kevin\

 
=================================TOP===============================
 Q235:  Thread-safe version of flex scanner   

In article <[email protected]>,
Donald A. Thompson  wrote:
% I am looking for a version of the flex program that produces a thread-safe
% lex.yy.c.

The version on this system (2.5) has a -+ option, which produces a C++
scanner class which is said to be re-entrant.

%  Alternatively, I'd like some tips on how to make the lex.yy.c
% thread-safe.

You need to re-implement the input routines, and change the interface to
yylex so that things like state variables, yyleng and yytext, and many
of those other globals are passed on the stack.  You don't have to worry
about the character class tables, since they're read-only, but pretty
much everything else needs to be put through the call stack. You then need
to create a skeleton file with your changes and tell flex to use it
instead of it's default one.

This is a big job, so you might think about either using the scanner
class from the -+ option, or having only one thread run the scanner,
and perhaps generate a byte-code which can be run by other threads.
--

Patrick TJ McPhee
East York  Canada
[email protected]

 

=================================TOP===============================
 Q236: POSIX standards, names, etc  


Jason L Reisman   wrote:
>Hi,
>I am new to the world of POSIX and am interested in finding out all I
>can before starting to code.  
>I have a few questions regarding the standard.  Any help would be
>greatly appreciated.
>
>(1) When looking up information on POSIX, I found POSIX.1, POSIX.4, etc.
> What do the numbers mean?  Are they indexes to different libraries or
>differt versions?

Lessee...  This is complex, due to the history of the thing.

POSIX.1 is really POSIX 1003.1, which is *the* POSIX standard (i.e. for
Portable Operating System Interfaces).  POSIX 1003.1 comes in several
flavors, which are dated.  The original is 1003.1-1990.  The realtime
interface, which was known during its development as 1003.4, and
then 1003.1b were combined in to 1003.1 and the resulting spec was
1003.1-1994.  Then the threads interface, which was known during development
as 1003.4a was renamed to 1003.1c, and then combined (with a technical
corrigenda to .1b) with 1003.1-1994 to produce 1003.1-1996.

And yes, it's ugly.   Here's a lousy attempt at a picture.  Time increases
from left to right.  If you're viewing this in something that doesn't display
news articles in a fixed-pitch font, it won't make sense.

   1003.4 --+------ 1003.4a ---+
            |                  |
            +- 1003.1b- +      +- 1003.1c -+
                        |                  |
1003.1 -----------------+-- 1003.1 ----+---+-- 1003.1 --- . . . (+.1a? etc)
 1990                        1994      |        1996
                    1003.1i -----------+
                (technical corrections to .1b)

1003.1 is the base.
1003.4 was "realtime extensions", and originally included threads.  Threads
  were broken out to smooth the merges.
1003.1b is the realtime API amendment to 1003.1
1003.1c is the threads API amendment to 1003.1
1003.1a is the amendments for symbolic links, coming very soon.

And the lettering indicates only when the projects were started, nothing
more.

>(2) Do POSIX sockets exist?  A better way to say this is there a
>standard interface (either created or supported by POSIX) to open and
>maintain a socket?

There is (yet another) set of amendments to 1003.1, known as 1003.1g, for
this.  I haven't looked at the drafts to see what the interface looks
like, though.

>(3) How compatible are pthreads between NT and Solaris (or any flavor of
>UNIX for that matter)?

If you have an NT pthreads implementation, I would hope that they're quite
similar.  Note that POSIX makes no requirements that threads be preemptive,
unless certain scheduling options are supported, and the application
requests them.  This is commonly known as "the Solaris problem."

>(3) Are there any recommended books for POSIX beginners (who already
>know how to program)?

Dave Butenhof's book, Programming with POSIX Threads, ISBN 0-201-63392-2,
is quite good.  In fact, I'd call it excellent, and that's not said lightly.
-- 
Steve Watt KD6GGD  PP-ASEL-IA              ICBM: 121W 56' 58.1" / 37N 20' 14.2"
 Internet: steve @ Watt.COM                             Whois: SW32
   Free time?  There's no such thing.  It just comes in varying prices... 

 
=================================TOP===============================
 Q237: Passing ownership of a mutex?  

[See the code example for FIFO Mutexes on this web page.
They *may* do what you want.  -Bil
 
"Fred A. Kulack" wrote:

> You can't portably unlock a mutex in one thread that was locked by another
> thread.

Fred's absolutely correct, but, as this is a common problem, I'd like to
expand and stress this warning.

The principal attribute of a mutex is "exclusive ownership". A locked mutex
is OWNED by the thread that locked it. To attempt to unlock that mutex from
another thread is not merely "nonportable" -- it is a severe violation of
POSIX semantics. Even if it appears to work on some platforms, it does not
work. You may not be getting correct memory visibility, for example, on SMP
RISC systems.

While POSIX does not require that any implementation maintain the ID of the
owning thread, any implementation may do so, and may check for and report the
optional POSIX errors for illegal use of the mutex. The Single UNIX
Specification, Version 2 (SUSV2, or UNIX98), adds support for various "types"
of mutexes, and even POSIX provides for optional thread priority protection
through mutex use. Most of these enhanced mutex attributes require that the
implementation keep track of mutex ownership, and most implementations that
track ownership will report violations.

A program that attempts to lock a mutex in one thread and unlock it in
another is incorrect, not just "potentially nonportable". It probably doesn't
work even where the implementation fails to warn you of your error. Don't
even think about doing that to yourself!

If you really want to "hand off" a lock from one thread to another, you don't
want to use a mutex. (This is almost always a sign that there's something
wrong with your design, by the way, but to all rules there are exceptions.)
Instead, use a binary semaphore. A semaphore doesn't have any concept of
"ownership", and it's legal to "lock" a semaphore (e.g., with the POSIX
operation sem_wait()) in one thread and then "unlock" the semaphore
(sem_post()) in another thread. Legal... but is it correct? Well, remember
that the synchronization operations are also memory visibility points. If you
lock a semaphore in one thread, and then modify shared data,  you need to be
sure that something else (other than the unlock you're not doing) will ensure
that the data is visible to another thread. For example, if you lock a
semaphore, modify shared data, and then create another thread that will "own"
the semaphore, the visibility rules ensure that the created thread will have
coherent memory with its creator; and the semaphore unlock will pass on that
coherency to the next thread to lock the semaphore.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

=================================TOP===============================
 Q238:   NT fibers  


=================================TOP===============================
 Q239:  Linux (v.2.0.29 ? Caldera Base)/Threads/KDE   

jimmy wrote:

> I have Caldera linux (version, if I remember it right 2.0.29, or close
> to that.) My questions: can I use threads with it? (It didn't come with
> the package, that's for sure.) If so, where would I get it, and what
> else do I need in order to start using them? Is there any problem with
> using KDE with threads? Finally, I read somewhere that g++ doesn't like
> threads--is that right? Am I limited to C if I use threads?

You are certainly not going to be able to use POSIX threads with Cladera.
The "linuxthreads" package comes with the glibc library.  Caldera is still
libc5 based.

That having been said, there is no reason why you can't use threads with
KDE.  Do you plan on *writing* KDE applications?  If so, you need to
determine whether or not Qt is thread-safe.

Lastly, there is no limitation on using only C when using POSIX threads.
In fact there is already a small C++ threads package called TThreads which
can provide some easier interface to the threading library.  (I think
they've implimented it wrong, but it's still quite usable.)


Paul Braman
[email protected]

 =================================TOP===============================
 Q240: How to implement user space cooperative multithreading?  

 wrote:
>Thanks for the help!
>
>1. My goal is to find a way to implement user space cooperative
>multithreading ( the reason is it should works with hardware description
>language which is event driven and basically serial ). The attached file
>context.c shows my basic understanding. Main() calls ping() and pong() and
>these two functions have private stacks so they can run independently. My
>question is how to avoid those stacks ( they are on heap not on the process's
>stack so it's not under control of OS kernel ) get overflowed.

Make the stacks sufficiently large, and watch your use of auto variables and
recursion. You can also probably roll some special assertions that can be put
at the start of a function that will go off when you are getting too close to
collision. Write a function which uses some tricks to return the remaining
stack space (for example, take the address of a local variable and then
subtract the stack top, or vice versa, depending on whether stacks grow up or
down). Assert if the stack space is down to a few K.

Or you could allocate stacks in their own memory mapped regions which
dynamically allocate pages as they are accessed, and therefore can be large
without occupying virtual memory.

>2. I don't understand how pre-emptive multithreading is implemented,
>especially the implementation of pure user space multithreading. I understand
>preemptive multitasking in process level -- kernel scheduler does the job.

This requires some support from the operating system or from the bare hardware.
I don't know about Win32. In UNIX, you can do it using alarm signals as the
basis for pre-emption. Win32 doesn't implement signals properly; Microsoft
has just paid some token attention to them, because the  header and
the signal() function are part of ANSI C.

In UNIX, signal delivery does most of the work already for implementing
pre-emptive multi-tasking within a single process. The signal mechanism already
takes care of saving the machine context. To do the remainder of a context
switch in a signal handler, all you really have to do is switch the procedure
context only---usually just the stack pointer. so you need some simple
context-switching function which takes a pointer to two stack context areas.
It saves the current stack context into one area, and restores a previously
stored context from the other area. (The other area is chosen by your scheduler
routine, of course, because it determines which thread will run).

The pre-emptive context switch essentially takes place when a signal handler
occurs in one thread, saves its stack context information, restores the
stack context info of another thread and executes a return. Later on,
the thread's turn will come up again, and it will return from the same
signal handler which pre-empted it earlier.

The detailed machine context is actually stored in the signal stack; when you
resume a thread by having it execute the return from the signal handler is when
the precise state of the CPU registers is restored. 

Also, you have to still take care of voluntary context switches, which don't go
through the signal mechanism. The entry point to the voluntary reschedule
context switch has to ensure that it saves at least those registers that are
designated by the compiler's conventions as being ``caller saved''.  When that
thread is later restarted, it will execute a return from that voluntary switch
function, and any registers that the compiler expects not to be clobbered must
be restored. These could just be saved on the stack, or they could be in the
same context area where the aforementioned stack context is put.

(It's best to not cut corners and write an earnest routine that performs a
complete machine context switch, even though this information is already in the
signal stack, and even though some registers may be clobbered during a function
call.)

The context switch looks like this (in pseudocode) and must be written
in assembly language, since C doesn't give you access to regiters:

    void context_switch(void *old_context, void *new_context)
    {
        /* save all registers into new context */
        /* restore registers from old context */
    }

The registers do not include the instruction pointer, because that is
stored on the stack of the context_switch() function. When the stack pointer
is switched, the function will return in a stack frame different from the
one it was called for, and the return address in this stack frame will
restore the correct IP.

On some RISC platforms, this stuff is tricky to do! Not all modern computing
platforms give you the neat layout where you have simple stack frames and CPU
registers. For example, above, I have assumed that the function return address
is in the stack frame. In a RISC system, the return address might actually be
in a register. And what about the SPARC architecture with register windows that
supply much of the functionality of a stack frame?

Therefore, take this advice: even if you write your own user-level threading
package, save yourself the headache and steal someone else's context switch
code. It's easy enough to write for a things like 68000 or 80x86, but forget
about SPARC or PA-RISC. :)

>??#@!, I don't have a clue, help!
>
>Jian
>
>#include 
>#include 
>#include 
>
>#define  STACK_SIZE 4096
>
>int       max_iteration;
>int       iter;
>
>jmp_buf   jmp_main;
>jmp_buf   jmp_ping;
>jmp_buf   jmp_pong;

These three buffers clearly form an extremely primitive process table.  In the
general case, you need a whole array of these, one for each thread.

>void  ping(void) {
>  int i = 0;
>  int j = 1000;
>
>  if (setjmp(jmp_ping) == 0)
>    longjmp(jmp_main, 1);

Right, here you are setting up the ping ``thread'' by saving its initial
context. Later, you will resume the thread by doing a longjmp() to this
context.

In a real threading package, you would have some code for creating a thread
which would allocate a free context, and then ``prime'' it so that the thread
is started as soon as the scheduler chooses it. This priming is often a hack
which tries to set up the thread context so that it looks like the thread
existed before and voluntarily rescheduled. In other words, you might have to
write things into the thread's stack which will fool context_switch() into
executing a ``fake'' return to the entry point of the thread!

I remember with fondness writing this type of fun code. :)

One fine afternoon I wrote a tiny pre-emptive multi-tasking kernel on a PC
using the DEBUG.COM's built in interactive assembler as my only development
tool. The whole thing occupied only 129 bytes. That same day, I wrote some
sample ``applications'' which animated things on the screen, as well as a KILL
command to terminate threads. The scheduling policy was round-robin with no
priorities. There was only one system call interrupt, ``reschedule'', which was
really just a faked clock interrupt to invoke the scheduler. :) The KILL
command worked by locating the kernel in memory by scanning through the
interrupt chain, moving the last process in the table over top of the one being
deleted and decrementing the process count---all with interrupts disabled, of
course. The argument to KILL was the process ID, which was just its position in
the table. This feature made it fun to guess which number to use to kill a
particular process, since the relocated process inherited the ID of the killed
process. :)

It was the day after exams and suddenly had nothing to do, you see, and was
eager to combine the knowledge gleaned from a third-year operating systems
class with obfuscated programming. :)

>  while (1) {
>    i += 2;
>    j += 2;

Oops! The variables i and j will no longer have their correct values
after you come back here from the longjmp. That's because setjmp
and longjmp don't (necessarily) save enough context information to
preserve the values of auto variables.

(Say, did you try running it after compiling with lots of optimizations turned
on?)

Declaring i and j volatile might help here, because that should force the
compiler to generate code which re-loads them from memory. Of course, the ANSI
C definition of volatile is a little wishy washy. :)

In un-optimized code, volatile is often redundant, because variables are not
aggressively assigned to registers, which explains why code like this often
works until a release version of the program is built with optimizations.

Or, by chance, the i and j registers may have been assigned to those ``caller
saved'' registers that got implicitly saved and restored in the call to
setjmp(). On 80x86 trash, there is such a dearth of registers that many
compilers mandate that most of the registers must not be clobbered by a
function. With GCC, I think that only EAX and EDX may be clobbered,
though I don't recall exactly.  This allows the generated code to keep
as many temporary values in registers as is reasonably possible,
at the cost of forcing called code to save and restore.

What you really need is to forget setjmp() and longjmp() and get a real context
switch routine. It shouldn't be hard to roll your own on Intel.  (If you aren't
using floating point math, you can get away with saving just the integer
registers, so you don't have to mess with that awful floating point ``stack''
thing that some junkies at Intel dreamed up during a shared hallucination.)

Anyway, I've ranted long enough about things that are probably of no interest
to anyone, so good night.


=================================TOP===============================
 Q241: Tools for Java Programming   



In article <[email protected]>,
    Bil Lewis  writes:
> I'm in the midst of finishing "Multithreaded Programming
> with Java" and am working on a short section covering which
> companies have interesting tools for debugging, testing,
> analyzing, etc. MT programs.
> 
> Obviously, I am covering Sun's JWS, Numega's JCheck, and
> Symantec's Java Cafe.  Are the other products that I should
> know about?
> 
> -Bil

Bil - Parasoft has a Java analyzer.  I haven't used it, but if it's as 
good as their C version (Insure++), it's probably worth writing about.  I 
think they have a "free" trial period too.   Look at www.parasoft.com for 
more information.

Sue Gleeson

>    >What I was wondering was if there was a tool (a lint sorta thing)
>    >available that would go through code and flag trouble spots, like global
>    >data usage, and static local data, etc.  I of course don't expect the
> 
>    That tool is your brain! If we are talking about C, you can look for
>    static data by using grep to look for the string ``static''.  That
>    is easy enough.

Unfortunately, "brains" are notoriously poor at analyzing concurrency,
and this is exactly the kind of problem that automated analysis and
testing tools are likely to do better than people.  Not as a
substitute for reasoning of course, but as a significant aid (just like lint)

I'm aware of at least two commercial tools that test for race
conditions and/or deadlocks:  

  AssureJ for Java (http://www.kai.com/assurej/) 
  Lock Lint for C (http://www.sun.com/workshop/threads/doc/locklint.ps)

My understanding is that there will soon be other tools in this space
as well.

>    >tool to fix any of the problems, nor really even know for sure when a
>    >problem is a problem, but just flag that there may be a problem.
> 
>    It's hard enough to automate the correctness verification of ordinary
>    single-threaded logic. The halting problem tells us that this isn't even
>    possible in the general case.

Luckily, you don't have to verify the correctness of software to be
useful.  For example, you can just check that the observed locking pattern
is consistent with a "safe" locking discipline (e.g. a particular
piece of shared data is always protected the same lock).  Myself and
some folks at DEC SRC  (now Compaq SRC), built a tool like this that
was extremely effective at finding race conditions.

See http://www.cs.washington.edu/homes/savage/papers/sosp97.ps for
details.

- Stefan

=================================TOP===============================
 Q242:  Solaris 2.6, phtread_cond_timedwait() wakes up early  
 
This may not answer the question, but it could solve the problem !

You can change the timer resolution in Solaris 2.6 and 2.7 by putting this
in /etc/system and rebooting.

set hires_tick = 1

This sets the system hz value to 1000.

Mark


John Garate wrote in message <[email protected]>...
>For PTHREAD_PROCESS_SHARED condition variables, pthread_cond_timedwait()
>timeouts
>occur up to about 20ms prior to the requested timeout time (sample code
>below).  I wasn't
>expecting this.  I realize clock ticks are at 10ms intervals, but I
>expected my timeout to occur at
>the soonest tick AFTER my requested timeout, not before.  Were my
>expectations out of line?
>
>cc -mt -lposix4 testwait.c
>
>/* testwait.c */
>#define _POSIX_C_SOURCE 199506L
>#include 
>#include 
>#include 
>#include 
>
>pthread_cond_t cv;
>pthread_mutex_t mutex;
>
>int main(int argc, char *argv[]) {
>  pthread_condattr_t cattr;
>  pthread_mutexattr_t mattr;
>  timespec_t  ts_now, ts_then;
>  int   timed_out;
>  int   rc;
>
>  /* condition variable: wait awakes early if PROCESS_SHARED */
>  if(pthread_condattr_init(&cattr;)) exit(-1);
>  if(pthread_condattr_setpshared(&cattr;, PTHREAD_PROCESS_SHARED))
>exit(-1);
>  if(pthread_cond_init(&cv;, &cattr;)) exit(-1);
>  if(pthread_condattr_destroy(&cattr;)) exit(-1);
>
>  /* mutex: doesn't matter whether PROCESS_SHARED or not (only cv
>matters) */
>  if(pthread_mutexattr_init(&mattr;)) exit(-1);
>  if(pthread_mutexattr_setpshared(&mattr;, PTHREAD_PROCESS_SHARED))
>exit(-1);
>  if(pthread_mutex_init(&mutex;, &mattr;)) exit(-1);
>  if(pthread_mutexattr_destroy(&mattr;)) exit(-1);
>
>  /* calculate future timestamp */
>  clock_gettime(CLOCK_REALTIME,&ts;_then);
>  ts_then.tv_sec+=1;
>
>  /* wait for that time */
>  timed_out = 0;
>  if(pthread_mutex_lock(&mutex;)) exit(-1);
>
>  while(!timed_out) {
>    rc = pthread_cond_timedwait( &cv;, &mutex;, &ts;_then );
>    clock_gettime(CLOCK_REALTIME,&ts;_now);
>
>    switch(rc) {
>    case 0:
>      printf("spurious, in my case\n");
>      break;
>    case ETIMEDOUT:
>      timed_out=1;
>      break;
>    default:
>      printf("pthread_cond_timedwait failed, rc=%d\n",rc);
>      exit(-1);
>    } /* switch */
>  } /* while (!timed_out) */
>
>  pthread_mutex_unlock(&mutex;);
>
>  /* did we wake-up before we wanted to? */
>  if (ts_now.tv_sec < ts_then.tv_sec ||
>     (ts_now.tv_sec == ts_then.tv_sec &&
>      ts_now.tv_nsec < ts_then.tv_nsec)) {
>    printf("ts_now  %10ld.%09ld\n", ts_now.tv_sec, ts_now.tv_nsec);
>    printf("ts_then %10ld.%09ld\n", ts_then.tv_sec, ts_then.tv_nsec);
>  }
>  return(0);
>} /* main */
>
>

=================================TOP===============================
 Q243:  AIX4.3 and PTHREAD problem  
 
In article <[email protected]>,
Red Hat Linux User   wrote:

%     After I sent this, I was talking to an IBM'er who was trying to convince me
% to upgrade
% our machines to AIX 4.3.  He mentioned that 4.3 provides POSIX support at
% level(?) 7 and
% 4.2 at level 5.  He said that he had to change some of his code because there

Find him, and kick him.

AIX 4.1 and 4.2 provide two thread libraries: one roughly implements a draft (7
if you must know) of the posix standard, and one implements DCE threads.

AIX 4.3 implements the posix standard, and provides backwards compatibility
with the other two libraries. There are slight changes from the draft support
available in the earlier releases. If significant code changes were needed
to compile on 4.3, the original code was probably written for DCE threads.

It's not too difficult to keep track of this, and if you're selling the
stuff you really have an obligation to at least try. Kick him hard.

You _can_ run programs compiled on 4.1 without change on a 4.3 machine.

--

Patrick TJ McPhee
East York  Canada

=================================TOP===============================
 Q244: Readers-Writers Lock source for pthreads  

In the hope someone may find it useful, here's and implementation of a
readers-writeres lock for PThreads. In this implementation writers are given
priority. Compile with RWLOCK_DEBUG defined for verbose debugging output to
stderr. This output can help track:

 1. Mismatches (eg rwlock_ReadLock(); ... rwlock_WriteUnlock();)
 2. Recursive locks (eg rwlock_ReadLock(); ... rwlock_ReadLock();)

Amongst other things. The debugging output also includes the line numbers of
where the lock was obtained (and released) for greater usefulness. The
debugging mode has been implemented using thread specific data.

Anyway, here's the source:

/* START: rwlock.h */
#ifndef __RWLOCK_H__
#define __RWLOCK_H__

/*
 * $Id: rwlock.h,v 1.8 1999/02/27 14:19:35 lk Exp $
 *
 * Copyright (C) 1998-99 Lee Kindness 
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#include 

struct rwlock;
typedef struct rwlock *rwlock_t;

#define RWLOCK_DEBUG 2

void     rwlock_Init(rwlock_t rwl);
rwlock_t rwlock_InitFull(void);
void     rwlock_Destroy(rwlock_t rwl, int full);
#ifdef RWLOCK_DEBUG
void     rwlock_ReadLockD(rwlock_t rwl, char *f, int l);
void     rwlock_ReadUnlockD(rwlock_t rwl, char *f, int l);
void     rwlock_WriteLockD(rwlock_t rwl, char *f, int l);
void     rwlock_WriteUnlockD(rwlock_t rwl, char *f, int l);
# define rwlock_ReadLock(R) rwlock_ReadLockD(R, __FILE__, __LINE__)
# define rwlock_ReadUnlock(R) rwlock_ReadUnlockD(R, __FILE__, __LINE__)
# define rwlock_WriteLock(R) rwlock_WriteLockD(R, __FILE__, __LINE__)
# define rwlock_WriteUnlock(R) rwlock_WriteUnlockD(R, __FILE__, __LINE__)
#else
void     rwlock_ReadLock(rwlock_t rwl);
void     rwlock_ReadUnlock(rwlock_t rwl);
void     rwlock_WriteLock(rwlock_t rwl);
void     rwlock_WriteUnlock(rwlock_t rwl);
#endif

#endif /* __RWLOCK_H__ */
/* END: rwlock.h */

/* START rwlock.c */
/*
 * $Id: rwlock.c,v 1.9 1999/02/27 14:19:35 lk Exp $
 *
 * Routines to implement a read-write lock. Multiple readers or one writer
 * can hold the lock at once. Writers are given priority over readers.
 * When compiled with RWLOCK_DEBUG defined verbose debugging output is
 * produced which can help track problems such as mismatches and
 * recursive locks.
 *
 * Copyright (C) 1998-99 Lee Kindness 
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#include 
#include 
#include 

#include "rwlock.h"

struct rwlock
{
#ifdef RWLOCK_DEBUG
    pthread_key_t   key;
#endif
    pthread_mutex_t lock;
    pthread_cond_t  rcond;
    pthread_cond_t  wcond;
    int             lock_count;
    int             waiting_writers;
};

#ifdef RWLOCK_DEBUG
struct LockPos
{
    int           type;
    char         *file;
    int           line;
    pthread_key_t key;
};

static void rwlocki_WarnNoFree(void *arg);
#endif

static void rwlocki_WaitingReaderCleanup(void *arg);
static void rwlocki_WaitingWriterCleanup(void *arg);

/*
 * rwlock_InitFull()
 *
 * Allocate the memory for, and initialise, a read-write lock.
 */
rwlock_t rwlock_InitFull(void)
{
    rwlock_t ret;

    if( (ret = calloc(sizeof(struct rwlock), 1)) )
 rwlock_Init(ret);

    return( ret );
}

/*
 * rwlock_Init()
 *
 * Initialise a static, or otherwise allocated, read-write lock.
 */
void rwlock_Init(rwlock_t rwl)
{
#ifdef RWLOCK_DEBUG
    pthread_key_create(&rwl-;>key, rwlocki_WarnNoFree);
#endif
    pthread_mutex_init(&rwl-;>lock, NULL);
    pthread_cond_init(&rwl-;>wcond, NULL);
    pthread_cond_init(&rwl-;>rcond, NULL);
    rwl->lock_count = 0;
    rwl->waiting_writers = 0;
}

/*
 * rwlock_Destroy()
 *
 * Free all memory associated with the read-write lock.
 */
void rwlock_Destroy(rwlock_t rwl, int full)
{
#ifdef RWLOCK_DEBUG
    pthread_key_delete(rwl->key);
#endif
    pthread_mutex_destroy(&rwl-;>lock);
    pthread_cond_destroy(&rwl-;>wcond);
    pthread_cond_destroy(&rwl-;>rcond);
    if( full )
 free(rwl);
}

/*
 * rwlock_ReadLock()
 *
 * Obtain a read lock.
 */
#ifdef RWLOCK_DEBUG
void rwlock_ReadLockD(rwlock_t rwl, char *f, int l)
{
    struct LockPos *d;
    if( (d = (struct LockPos *)pthread_getspecific(rwl->key)) )
 {
     fprintf(stderr, "RWL %p %s:%d already has %s lock from %s:%d\n",
      rwl, f, l, d->type ? "write" : "read", d->file, d->line);
     /* but we'll carry on anyway, and muck everything up... */
 }
    if( (d = malloc(sizeof(struct LockPos))) )
 {
     /* init the TSD */
     d->type = 0; /* read */
     d->file = f;
     d->line = l;
     d->key  = rwl->key;
     /* and set it */
     pthread_setspecific(rwl->key, d);
#if RWLOCK_DEBUG == 2
     fprintf(stderr, "RWL %p %s:%d read lock pre\n", rwl, f, l);
#endif
 }
    else
 fprintf(stderr, "RWL %p %s:%d cannot alloc memory!\n", rwl, f, l);
#else
void rwlock_ReadLock(rwlock_t rwl)
{
#endif
    pthread_mutex_lock(&rwl-;>lock);
    pthread_cleanup_push(rwlocki_WaitingReaderCleanup, rwl);
    while( (rwl->lock_count < 0) && (rwl->waiting_writers) )
 pthread_cond_wait(&rwl-;>rcond, &rwl-;>lock);
    rwl->lock_count++;
    /* Note that the pthread_cleanup_pop subroutine will
     * execute the rwlocki_WaitingReaderCleanup routine */
    pthread_cleanup_pop(1);
#ifdef RWLOCK_DEBUG
    fprintf(stderr, "RWL %p %s:%d read lock\n", rwl, f, l);
#endif
}

/*
 * rwlock_ReadUnlock()
 *
 * Release a read lock
 */
#ifdef RWLOCK_DEBUG
void rwlock_ReadUnlockD(rwlock_t rwl, char *f, int l)
{
    struct LockPos *d;
#else
void rwlock_ReadUnlock(rwlock_t rwl)
{
#endif
    pthread_mutex_lock(&rwl-;>lock);
    rwl->lock_count--;
    if( !rwl->lock_count )
 pthread_cond_signal(&rwl-;>wcond);
    pthread_mutex_unlock(&rwl-;>lock);
#ifdef RWLOCK_DEBUG
    if( (d = pthread_getspecific(rwl->key)) )
 {
     if( d->type == 0 )
  fprintf(stderr, "RWL %p %s:%d read unlock at %s:%d\n", rwl,
   d->file, d->line, f, l);
     else
  fprintf(stderr, "RWL %p %s:%d mismatch unlock %s:%d\n", rwl,
   d->file, d->line, f, l);
     free(d);
     pthread_setspecific(rwl->key, NULL);
 }
    else
 fprintf(stderr, "RWL %p %s:%d read unlock with no lock!\n", rwl, f, l);
#endif
}

/*
 * rwlock_WriteLock()
 *
 * Obtain a write lock
 */
#ifdef RWLOCK_DEBUG
void rwlock_WriteLockD(rwlock_t rwl, char *f, int l)
{
    struct LockPos *d;
    if( (d = (struct LockPos *)pthread_getspecific(rwl->key)) )
 {
     fprintf(stderr, "RWL %p %s:%d already has %s lock from %s:%d\n",
      rwl, f, l, d->type ? "write" : "read", d->file, d->line);
     /* but we'll carry on anyway, and muck everything up... */
 }
    if( (d = malloc(sizeof(struct LockPos))) )
 {
     /* init the TSD */
     d->type = 1; /* write */
     d->file = f;
     d->line = l;
     d->key  = rwl->key;
     /* and set it */
     pthread_setspecific(rwl->key, d);
#if RWLOCK_DEBUG == 2
     fprintf(stderr, "RWL %p %s:%d write lock pre\n", rwl, f, l);
#endif
 }
    else
 fprintf(stderr, "RWL %p %s:%d cannot alloc memory!\n", rwl, f, l);
#else
void rwlock_WriteLock(rwlock_t rwl)
{
#endif
    pthread_mutex_lock(&rwl-;>lock);
    rwl->waiting_writers++;
    pthread_cleanup_push(rwlocki_WaitingWriterCleanup, rwl);
    while( rwl->lock_count )
 pthread_cond_wait(&rwl-;>wcond, &rwl-;>lock);
    rwl->lock_count = -1;
    /* Note that the pthread_cleanup_pop subroutine will
     * execute the rwlocki_WaitingWriterCleanup routine */
    pthread_cleanup_pop(1);
#ifdef RWLOCK_DEBUG
    fprintf(stderr, "RWL %p %s:%d write lock\n", rwl, f, l);
#endif
}

/*
 * rwlock_WriteUnlock()
 *
 * Release a write lock
 */
#ifdef RWLOCK_DEBUG
void rwlock_WriteUnlockD(rwlock_t rwl, char *f, int l)
{
    struct LockPos *d;
#else
void rwlock_WriteUnlock(rwlock_t rwl)
{
#endif
    pthread_mutex_lock(&rwl-;>lock);
    rwl->lock_count = 0;
    if( !rwl->waiting_writers )
 pthread_cond_broadcast(&rwl-;>rcond);
    else
 pthread_cond_signal(&rwl-;>wcond);
    pthread_mutex_unlock(&rwl-;>lock);
#ifdef RWLOCK_DEBUG
    if( (d = pthread_getspecific(rwl->key)) )
 {
     if( d->type == 1 )
  fprintf(stderr, "RWL %p %s:%d write unlock at %s:%d\n", rwl,
   d->file, d->line, f, l);
     else
  fprintf(stderr, "RWL %p %s:%d mismatch unlock %s:%d\n", rwl,
   d->file, d->line, f, l);
     free(d);
     pthread_setspecific(rwl->key, NULL);
 }
    else
 fprintf(stderr, "RWL %p %s:%d write unlock with no lock!\n",rwl, f, l);
#endif
}

static void rwlocki_WaitingReaderCleanup(void *arg)
{
    rwlock_t rwl;

    rwl = (rwlock_t)arg;
    pthread_mutex_unlock(&rwl-;>lock);
}

static void rwlocki_WaitingWriterCleanup(void *arg)
{
    rwlock_t rwl;

    rwl = (rwlock_t)arg;
    rwl->waiting_writers--;
    if( (!rwl->waiting_writers) && (rwl->lock_count >= 0) )
 /*  This only happens if we have been cancelled */
 pthread_cond_broadcast(&rwl-;>wcond);
    pthread_mutex_unlock(&rwl-;>lock);
}

#ifdef RWLOCK_DEBUG
static void rwlocki_WarnNoFree(void *arg)
{
    struct LockPos *d = (struct LockPos *)arg;

    fprintf(stderr, "RWL 0 %s:%d exit during lock-unlock pair\n",
     d->file, d->line);
    free(d);
    pthread_setspecific(d->key, NULL);
}
#endif
/* END rwlock.c */

=================================TOP===============================
 Q245: Signal handlers in threads   

In article <[email protected]>,

Thank you for posting an answer !

This is pretty tricky ... I'll give it a try ....


  Jeff Denham  wrote:
> Yes -- as I said recently in a post regarding a similar
> question about sigwait() -- only the faulting threads can
> catch its own synchronous signals/exceptions.
>
> You don't have to do the work strictly in a signal
> handler, though. If you have a stack-based exception
> handling package available to you, such as the
> try/catch model in C++,  you can handle the
> synchronous exceptions in the exception handler.
> This model essentially unwinds the
> stack at the point the signal is caught
> by a special handler and delivers it back
> (close to) the orignal context and outside
> of the signal-handler state. At this point,
> you're at "thread level" and can
> pretend you just returned from
> a call to sigwait() ;^)
>
> (If I'm being overoptimistic about
> actually being at thread level in
> the catch() clause, someone please
> correct me.)
>
> Here's a little example that catches
> a SIGILL instruction on Solaris,
> built using their V4.2 C++ compiler
> and runtime:
>
> #include 
> #include 
> #include 
> #include 
>
> int junk = -1;
>
> class SigIll
> {
> public:
>         SigIll(void) {};
> };
>
> void ill(int sig)
> {
>         throw SigIll();
> }
>
> main()
> {
>         typedef void (*func)(int);
>         func f, savedisp;
>
>         savedisp = signal(SIGILL, ill);
>         try {
>                 cout << "Issue illegal instruction...\n" << endl;
>                 f = (func)&junk;
>                 (*f)(1);
>         }
>         catch (SigIll &si;) {
>                 cout << "Exception!!!" << endl;
>         }
>         cout << "Survived!\n" << endl;
>         (void) signal(SIGILL, savedisp);
> }
>
> I'm hardly an expert with the exception stuff, so hopefully
> Kaz and the gang will correct/fill-in for me.
>
> -- Jeff
> __________________________________________________
> Jeff Denham ([email protected])
>
> Bright Tiger Technologies:  Resource-management software
> for building and managing fast, reliable web sites
> See us at http://www.brighttiger.com
>
> 125 Nagog Park
> Acton, MA 01720
> Phone: (978) 263-5455 x177
> Fax:   (978) 263-5547
=================================TOP===============================
 Q246: Can a non-volatile C++ object be safely shared amongst POSIX threads?  
 
In message <[email protected]>, "David Holmes"
 wrote:

>I tend to agree with Kaz - I'm unconvinced that there is some global law of
>compilation that takes care of this. Whilst simple compilers would not
>optimise across function calls because of the unknown affects of the
>function, smarter compilers employing data flow analysis techniques,
>whole-program optimisation etc may indeed make such optimisations - after
>all pthread_mutex_lock() does not access the shared data and the compiler
>(without thinking about threads) may assume that the shared data is thus
>unaffected by the call and can be cached.

Please see ISO/IEC 9899-1990, section 5.1.2.3, example 1.

>Now maybe all that means is that smart compilers have to be thread-aware and
>somehow identify the use of locks and thereby imply that data protected by
>locks is shared and mustn't be optimised.

There's no way a C/C++ compiler can know what data are "protected by
locks" - there's no such thing as "locks" in either language.

> But do the compiler writers know
>this? I think perhaps the use of simple compilers allows us to currently get
>away with this.

Would you care to name a couple of "simple" compilers?

Anyway, you can take my word for it - compiler writers are usually smart
enough to know they are doing compilers for potentially multi-threaded
environment. At least I know that gcc/egcs and SunSoft folks are.

>David
=================================TOP===============================
 Q247:  Single UNIX Specification V2  

A web reference you may find useful is
http://www.unix-systems.org/single_unix_specification_v2/xsh/threads.html

This contains an overview of POSIX Threads (as contained in the
Single UNIX Specification V2) and links to all the pthreads functions.

You can even download a copy of the specification from
that site (see http://www.unix-systems.org/go/unix )


=================================TOP===============================
 Q248: Semantics of cancelled I/O (cf: Java)  
 
David Holmes wrote:

> In Java there is currently a problem with what is termed interruptible I/O.
> The idea is that all potentially blocking operations should be interruptible
> so that the thread does not block forever if something goes wrong. The idea
> is sound enough (though timeouts would allow an alternative solution).
> However Java VM's do not actually implement interruptible I/O except in a
> very few cases. Discussion on the Javasoft BugParade indicates that whilst
> unblocking the thread is doable on most systems, actually cancelling the I/O
> request is not - consequently the state of the I/O stream is left in
> indeterminate as far as the application is concerned
>
> This leads me to wonder how POSIX defines the semantics of cancellation when
> the cancellation point is an I/O operation. Does POSIX actually specify what
> the affects of cancellation are on the underlying I/O stream (device,
> handle, whatever) or does it simply dictate that such operations must at
> some stage check the cancellation status of the current thread?
>
> Thanks.

POSIX hasn't a lot to say about the details of cancelled
I/O.  It has required and optional cancellation points.
Most, if not all, the required points are traditional
blocking system calls. Most of the optional ones
are library I/O routines. From my kernel and
library experience, it's a lot easier to cancel the
system calls than the library calls, because the
library calls can hold internal I/O mutexes (yikes)
across system calls. If that system call is canceled,
the locks must be released. That means the library
has to have cleanup handlers in stdio and elsewhere
-- doable but potentially costly in implementation
and performance. At Digital, last I knew, we were
planning to support the optional points in the
future (past V5.0?). Don't know the current status.

In practice, the semantics of syscall cancellation are pretty
much those of signals (and in a number of implementation I
know of, pthread_cancel() involves some kind of
specialized signal).  In other words, if you're blocked in a
read() system call, and a SIGXXX arrives, you'll be broken
out of the sleep, and, if a signal handler is present, take
the signal. If SA_RESTART is not on for the signal, the
read() call returns with status -1/errno EINTR. The outcome
of the I/O operation is undefined. In the case of cancellation,
the error return path from the system call is redirected to
a special cancellation handler in the threads library, which
starts the process of propagating the cancel down the calling
thread's stack.

When I implemented system call cancellation on Digital
UNIX, I followed this signal model, which applies only
to *interruptible* sleeps in the kernel. If there's actual
physical I/O in progress, the blocking in the kernel will be
*uninterruptible*. This is the case when a physio()
operation is in progress, meaning that the I/O buffer
is temporarily wired into memory and that the thread
calling read() cannot leave the kernel until the I/O
completes and the memory is unwired. In these cases,
the cancellation pends, just like a blocked signal, until
the read() thread is about to exit the kernel, at which point
the pending cancel is noticed and raised in the usual
way.

So, in the case of both an EINTR return and a cancel,
the calling thread never has a chance to examine the
outcome of the I/O operation. For a cancellation,
the I/O may be complete, but the canceled thread
will never see that fact directly, because its stack
will be unwound by the cancellation handler
past the point where the read() was called.

I'm not sure whether this ramble is at all on point
for you... There's probably nothing here you don't already
know, but maybe there's a few useful hints.
The bottom line is that most OSs offer very
little in the way canceling I/O that has already
been launched. If you look at the AIO section
of POSIX.1c, specifically at aio_cancel(),
you'll notice that the implementation is
not required to do anything in response
to the cancellation request. The only real
requirement that I recall is to return
an AIO_CANCELED status on successful
cancellation. But you may never get that
back. (On Digital UNIX, you can cancel
AIO that's still queued to libaio, but
for kernel based AIO, you'll never
successfully cancel -- the request
is gone into the bowels of the I/O
subsystem.)

So, FWIW, sounds to me like you should map this
Java I/O cancel thing right onto pthread
cancellation...

-- Jeff
__________________________________________________
Jeff Denham ([email protected])

Bright Tiger Technologies:  Resource-management software
for building and managing fast, reliable web sites
See us at http://www.brighttiger.com

125 Nagog Park
Acton, MA 01720
Phone: (978) 263-5455 x177
Fax:   (978) 263-5547

 
Jeff Denham wrote in message <[email protected]>...
>So, FWIW, sounds to me like you should map this
>Java I/O cancel thing right onto pthread cancellation...


Thanks Jeff. You seemed to confirm basically what I thought.

With the java situation there are problems both with implementing
interruptions on different platforms and establishing what the semantics of
interruptions are and how they can be used. Perhaps part of the problem is
that in Java they have to both deal with the semantics at the lowest level
of the API's (similar to the level POSIX works at) and at a higher level
too. I was just curious how POSIX dealt with the issue - maybe the Java folk
are worrying too much. FYI here's a snip from the relevant bug parade entry
(4154947):

Besides the above implementation issues, we also need to consider the usage
of interruptable semantics. Considering when one user (Java) thread need to
wake up another thread, (let me name it "Foo") which is blocked on the
DataInputStream, which wraps SocketInputStream which wraps recv(). When the
interrupt exception is thrown, the exception will be propagated all the way
up to the user level. However the state of DataInputStream,
SocketInputStream, recv() are possibly in unknown state. If the user ever
want to resume the io operation later, he may get unknown data from stream,
and get totally lost. So Foo has to remember to close the stream if he get
interrupted. But in this way, the usability of interruptable is largely
lost. It is much like the close() semantics of windows. When I use grep to
search the entire build tree, the IOException appear at about 1600 places.
There are 67 places catch IOException, but only 9 places catch
InterruptedIOException in PrintStream
and PrintWriter class. Generally, the InterruptedIOException is considered
as IOException, treated as fatal error. Making InterruptedIOException to
have resumption semantics will be extremely difficult on any platform, and
will be
against the semantics of Java language exception. But if we choose
termination semantics, the interruptable io is very similar to the close()
semantics.


Thanks again,
David

 =================================TOP===============================
 Q249: Advice on using multithreading in C++?  

On Tue, 30 Mar 1999 09:55:30 +0100, Ian Collins  wrote:
>Paul Black wrote:
>> 
>> Does anyone have any advice on using multithreading in C++? Searching around,
>> I've noticed a book "OO multithreading in C++". The book seemed to get a
>> mixed reaction on the online bookstores, is it a recommended read? Are there
>> any other books or resources to be recommended?
>> 
>A few guides:
>
>Use a static member function as the thread run function, pass it 'this'
>in pthread_create and cast the thread void* back to the class to use it.
>
>Make sure you understand the relationship between key data and class
>data.
>
>Take care with class destruction and thread termination.  I tend to use 
>joinable threads, so the class destructor can join with the thread.

This is not good. By the time you are in the destructor, the object should no
longer be shared; the threads should be already joined. When the destructor is
executing, the object is no longer considered to be a complete object.

It's not that calling the join operations is bad, what's bad is that there are
still threads running. A particularly bad thing is to be in the destructor of
the base class sub-object, with threads still roaming over the derived object!

>Make sure your thread does not run before the containing class is
>constructed!  This can cause wierd problems on MP systems.

Actually it can cause weird problems in non-MP systems too. It is simply
verboten to use an object before it is constructed.  Therefore, it's
a bad idea to launch the internal threads of active objects from within
the constructors of those objects. Such threads may be scheduled to run before
construction completes, which may happen in non-MP systems too.

The best practice with respect to destruction and construction is this:
an object may be shared by multiple threads only after construction
completes and before destruction commences.

One way to do this is to write your active objects such that they have a
Start() method that is not called from the constructor, and a Join() method
that is separate from the destructor. The caller who created the object and
called its constructor calls Start() immediately after construction, or perhaps
after synchronizing.  The Join() method simply joins on all of the threads
associated with the active object.  Usually, I also implement a method called
Cancel() which triggers shutdown of all the threads. 

Having a separate Start() method is useful not only from a safety point of
view, but it has practical uses as well. Here is an example.

In one project I'm working on, I have a protocol driver object which has two
sub-objects: a protocol object, and a device driver object.  Both of these
invoke callbacks in the driver, which sometimes passes control back---for
example, the device driver object may hit a callback that passes received data,
which is then shunted to the protocol object, which then may invoke a callback
again to pass up processed data.

The protocol object doesn't have any threads, but it does register a timer,
which is practically as good as having a thread. The driver has two threads,
for handling input and output.

If I registered the timer immediately after  constructing the protocol object,
and started the I/O threads immediately after constructing the driver, it would
be a very bad thing! Because either object might start hitting callbacks, which
would end up inside the other object that is not yet constructed.

Because I have separate start-up methods, I can implement a construction phase
that builds both objects, and then a start phase which starts their threads or
timers. 

Similarly, when I'm shutting down the arrangement, it would be terrible to stop
the threads of the driver and destroy the driver, because the protocol timer is
still running! Having Cancel() and Join() separate from destruction lets me
kill all of the timer and thread activities for both objects, and then release
the memory.


=================================TOP===============================
 Q250:  Semaphores on Solaris 7 with GCC 2.8.1   
 

I am writing a mutliprocess application that will utilize a circular
buffer in a shared memory segment.  I am using two semaphores to
represent the # of available slots, and the # of slots to consumer by
the server (consumer).

The apps follow this simple model.

Producer:
    decrement the available_slots semaphore
    do something...
    increment the to_consume semaphore.

Consumer:
    decrement the to_consume semaphore.
    do something...
    increment the available_slots semaphore.

The problem is that when I run my test programs and watch the semaphore
values, I see the available_slots semaphore continually increase?  The
program will run for a while if I remove the last increment in the
consumer program, but will eventually fail with errno 34, Result to
large.  Studying the output, it does not appear to me that the value of
the two semaphores ever reaches a critical point.

This simple example has been almost copied line for line from two
different books on this subject, both yielding the same results.  I have
included the source to both of my test apps.  If anyone can see, or
knows of something that I am just overlooking, I would very much like to
hear from you.

Thanks
Nicholas Twerdochlib

Platform info:
    Sun Sparc 20 dual ROSS 125Mhz CPUs 64MB RAM
    Solaris 7/2.7
    GCC v2.8.1

Server/consumer source:
*****************************************************************
#include 
#include 
#include 
#include 
#include 
#include 

union semun {
 int val;
 struct semids_ds *buf;
 ushort *array;
};

static ushort start_val[2] = {6,0};

union semun arg;

struct sembuf acquire = {0, -1, SEM_UNDO};
struct sembuf release = {0, 1, SEM_UNDO};

int main( void ) {
  int semid;
  key_t SemKey = ftok( "/tmp/loggerd.sem", 'S' );

  if( (semid = semget( SemKey, 2, IPC_CREAT|0666 )) != -1 ) {
    arg.array = start_val;
    if( semctl( semid, 0, SETALL, arg ) < 0 ) {
      printf( "Failed to set semaphore initial states.\n" );
      perror( "SEMCTL: " );

      return -1;
    }
  }

  while( 1 ) {
    printf( "A Ready to consume: SEM %d Value: %d\n", 0, semctl(semid,
0, GETVAL, 0) );
    printf( "A Ready to consume: SEM %d Value: %d\n", 1, semctl(semid,
1, GETVAL, 0) );

    acquire.sem_num = 1;
    if( semop( semid, &acquire;, 1 ) == -1 ) {
      perror( "server:main: acquire: " );
      exit( 2 );
    }

    printf( "B Ready to consume: SEM %d Value: %d\n", 0, semctl(semid,
0, GETVAL, 0) );
    printf( "B Ready to consume: SEM %d Value: %d\n", 1, semctl(semid,
1, GETVAL, 0) );
    /*
    release.sem_num = 0;
    if( semop( semid, &release;, 1 ) == -1 ) {
      perror( "server:main: release: " );
      exit( 2 );
    }
    */
  }
}
**************************************************************************

Client/producer source
**************************************************************************

#include 
#include 
#include 
#include 
#include 
#include 

union semun {
 int val;
 struct semids_ds *buf;
 ushort *array;
};

static ushort start_val[2] = {6,0};

union semun arg;

struct sembuf acquire = {0, -1, SEM_UNDO};
struct sembuf release = {0, 1, SEM_UNDO};

int main( void ) {
  int semid;
  key_t SemKey = ftok( "/tmp/loggerd.sem", 'S' );

  if( (semid = semget( SemKey, 2, 0)) == -1 ) {
    perror( "client:main: semget: " );
    exit( 2 );
  }

  printf( "A Ready to consume: SEM %d Value: %d\n", 0, semctl(semid, 0,
GETVAL, 0) );
  printf( "A Ready to consume: SEM %d Value: %d\n", 1, semctl(semid, 1,
GETVAL, 0) );

  acquire.sem_num = 0;
  if( semop( semid, &acquire;, 1 ) == -1 ) {
    perror( "client:main: release: " );
    exit( 2 );
  }

  printf( "B Ready to consume: SEM %d Value: %d\n", 0, semctl(semid, 0,
GETVAL, 0) );
  printf( "B Ready to consume: SEM %d Value: %d\n", 1, semctl(semid, 1,
GETVAL, 0) );

  release.sem_num = 1;
  if( semop( semid, &release;, 1 ) == -1 ) {
    perror( "client:main: acquire: " );
    exit( 2 );
  }
}
 

>buffer in a shared memory segment.  I am using two semaphores to
>represent the # of available slots, and the # of slots to consumer by
>the server (consumer).

>The apps follow this simple model.

>Producer:
>    decrement the available_slots semaphore
>    do something...
>    increment the to_consume semaphore.

>Consumer:
>    decrement the to_consume semaphore.
>    do something...
>    increment the available_slots semaphore.

>struct sembuf acquire = {0, -1, SEM_UNDO};
>struct sembuf release = {0, 1, SEM_UNDO};


The error is quite simple; you shouldnt' specify SEM_UNDO for
semaphores that are not incremented decremented by the same process.

SEM_UNDO should be used for a single process that increments and
decrements the semaphore.  WHen the process is killed, the net
effect of the process on the sermaphore will be NIL because of the
adjust value.

With SEM_UNDO, each decrement in the producer will cause the "semadj"
value associated with the "available_slots" semaphore to be increased
by one.  When the produced exits, the semaphore will be incremented by N,
not what you want in this case.

Solaris also puts a bound on teh semadj value; there is no good reason for
this bound, except that it catches programming errors like yours.

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.


=================================TOP===============================
 Q251:  Draft-4 condition variables (HELP)   

"D. Emilio Grimaldo Tunon" wrote:

>      Could anybody comment on the condition variable differences
> between the latest Posix standard (draft 10?) and the old
> draft 4 (DCE threads?) found in HP-UX 10.20?

There's no "draft 10". There was, once, a draft of the document that
would become the basis of the POSIX 1003.1-1996 standard, that was
labelled draft 10. That document is not the same as POSIX 1003.1-1996,
and the differences are more than a matter of "formalese". Some problems
were found during the editing to integrate the draft 10 text into
1003.1b-1993. In addition, the 1003.1i-1995 (corrections to the realtime
specification, some of which overlapped 1003.1c text) were integrated at
the same time. The standard is POSIX 1003.1-1996. There is no draft 10.

Also, the implementation of "draft 4" that you'll find in HP-UX 10.20
isn't really draft 4. It was a very loose adaptation of most of the text
of the draft, plus a number of extensions and other changes. I prefer to
call it "DCE threads" to make it clear that it's a distinct entity.

Now. There are no differences in condition variables from DCE threads to
POSIX threads. However, many of the names were changed "clean up" the
namespace and better reflect various persons' opinions regarding exactly
what the interfaces ought to be assumed to do.

One of the differences, stemming from the draft 5 addition of static
initialization of synchronization objects, is that they are now
"initialized" (i.e., assumed to be pre-existing storage of unspecified
value) rather than "created" (where the pthread_cond_t type, et al, were
assumed to be pointers or "handles" to dynamically created storage).

> In particular I have run into the 'problem' that neither
> pthread_condattr_init() nor pthread_mutexattr_init() seem
> to be present, I did find:

If you're moving between POSIX threads and DCE threads, you've got many
worse problems. While much of the interface appears similar, every
function (except pthread_self()) has changed in at least one
incompatible way. Be very, very careful about such a move! Do not
consider it "a weird variety of POSIX threads". It's not. It's "DCE
threads", as different a beast from POSIX threads as is UI threads. Many
of the names are similar, and they do something that's often even more
similar -- but porting requires a lot more thought than you might infer
from those similarities. (For example, DCE threads report errors by
setting errno and returning -1, while POSIX threads takes the much more
reasonable and efficient approach of returning an error code directly.)

HP-UX 10.30 (or, more realistically, 11.0) has POSIX threads. Your best
option is to ignore HP-UX 10.20 entirely and require 11.0. But, if you
can't or won't do that, be really careful, and assume nothing.

/---------------------------[ Dave Butenhof ]--------------------------\
| Compaq Computer Corporation                     [email protected] |
| 110 Spit Brook Rd ZKO2-3/Q18       http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698  http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/

Try this document.

Porting DCE Threads Programs to HP-UX 11.0 POSIX Threads 
http://info.fc.hp.com/hpux/content/d8.html

You will also find the following book useful.

Threadtime by Scott Norton and Mark Dipasquale. HP Press, Prentice Hall.
ISBN 0-13-190067-6

Discusses about programming using POSIX threads in general
and also about HP-UX specific features.

Vijay
 

=================================TOP===============================
 Q252:  gdb + linuxthreads + kernel 2.2.x = fixed :)   

After two solid days of differential testing, I found the problem that was
preventing me from debugging of threads under gdb.  It isn't kernel version
related, but it is rather strange so I thought I would share it for the
common curiosity...

It appears that if you are trying to debug a program that links to
libpthread.so, and the symbols for that library are not loaded, the
debugging doesn't work.  In my case, I was doing a "set auto-solib-add 0",
to avoid wading through all the libc and other system library stuff, and/or
getting messages from ddd about no source files, ending up in "space" etc.
Apparently, because the symbols for libpthread weren't loaded, the debugging
was not working properly.

Doing a manual load on the library using "sharedlibrary libpthread" solves
the problem.  Threads are then detected and debuggable (?!).

Does anyone know if this behavior is "by design" or "by accident" ?

Thank you very much to the people who responded to my original post.

regards,

Paul Archard
-------------
parch      // get rid of the
@          // comments to
workfire // reveal the
.com      // email address!


 

On Thu, 08 Apr 1999 19:55:24 GMT, Paul Archard  wrote:
>Doing a manual load on the library using "sharedlibrary libpthread" solves
>the problem.  Threads are then detected and debuggable (?!).
>
>Does anyone know if this behavior is "by design" or "by accident" ?

It's probably by design. The GDB patch adds LinuxThreads debugging ability by
making GDB peek at internal LinuxThreads data structures. Indeed, LinuxThreads
itself had to be modified to allow the hack to work by providing some extra
debugging info.

Presumably, without the symbols, GDB can't find the addresses of LinuxThreads
objects that it needs to access. 

=================================TOP===============================
 Q253: Real-time input thread question  

On Mon, 12 Apr 1999 13:54:28 GMT, JFCyr  wrote:
>We want our input thread to read some device values at an exact frequency.
>What is the best way? 

Depending on the frequency, you may need a hard real-time kernel which can
schedule your thread to execute periodically with great accuracy. In such
operating systems, the kernel is typically preemptible and takes care not to
disable interrupts for long periods of time.

>- A loop (within the input thread) working on the exact frequency with an 
>RTDSC test.

In POSIX threads, you could use pthread_cond_timedwait to suspend the thread
until a specified time. However, the accuracy of this will be restricted by the
degree to which your OS supports real-time processing. 

>- A WM_TIMER message to the primary window thread

Under windows? You can't get anything better than 10 ms resolution, if that,
and it's not real time by any measure. Too many ``guru meditations''.  If the
frequency is faster than, say, 20-30 Hz, forget it.  On Intel machines, the
Windows clock tick is 100Hz; even though interfaces like WaitForSingleObject()
and Sleep() have parameters expressed in milliseconds, the granularity of any
timed wait is ten milliseconds.  The Win32 API sucks for programming timers,
too.  The various sleep and wait functions express their timeout parameter as a
displacement from the current time rather than as an absolute wake time.  Also,
there is no signal handling; you can't install a periodic interrupt timer.
What you can do is poll the tick count in between thread sleeps.  What you can
do is sleep for shorter periods and check the current tick count.
 
Another thing you could do, in principle, is write a device driver that
performs the data acquisition at the specified time intervals. Inside the
driver, chances are that you have access to more accurate timing, since you are
in the kernel; also, faster access to the device. Thus you can approximate
real-time processing. The driver can queue the results for collection by the
application, which can take its sweet time.

=================================TOP===============================
 Q254: How does Solaris implement nice()?  
> Kelvin,
> 
>   Thanks!  That helps.  One related question: How does NICE work?
> I mean if it just raises/lowers the LWP's priority level once, then
> after a couple of quantum it would become meaningless.
> 
> -Bil

Nice is there to maintain the compatibility of Solaris to the
older Unix and it works in a funny way. 

First, when a user set a nice value, a priority value is calculated 
based on this nice value using some formula. This priority value 
is then passed onto the kernel using priocntl, which becomes the 
user portion (ts_upri) of the LWP priority. The priority table that I talked 
about in my message contributes the system portion (of the LWP priority.
Therefore, we have

pri = ts_cpupri  + ts_upri + ts_boost

ts_boost is the boosting value for the IA class. The CPU picks the LWP
with the highest pri to execute next.

When a user set a nice -19 on a LWP, ts_upri is -59. Since the largest
ts_cpupri in the table is 59, pri is always 0, unless it is in IA 
and has a boost value. If a user wants a finer control of the priority,
instead of using nice, he/she can use priocntl to set ts_upri directly.

Hope this help,

Kelvin



 
=================================TOP===============================
 Q255: Re: destructors and pthread cancelation...   
Hi Bil,

I noticed that you responded to a fellow indicating that the Sun C++
compiler will execute local object destructors upon pthread_exit() and also
if canceled.

Do you know what version of the compiler does this?

As you may recall, I sent you a very long winded email last year complaining
about how UNIX signal handling, C++ exception handling, and pthread
cancellation don't work together.

This bit of information about compiler support on pthread_exit and
cancellation would help solve most of my problems.

ie) if a SIGSEGV occurs, or some fatal FPE, my signal handler could simply
call pthread_exit and I'd get stack based object destructors invoked for
free (yay!).



Do you know if these semantics of pthread_exit and cancellation will be
adopted by the POSIX committee at some point????

I've also heard rumblings that there is a new PThreads standard draft...  I
haven't seen anything though... word of mouth...

Cheers,

John.

John Bossom
[email protected]


=================================TOP===============================
 Q256: A slight inaccuracy WRT OS/2 in Threads Primer  
 
From: Julien Pierre  

Thanks for a most excellent book.

I have been doing multithreaded programming under OS/2 for about 5 years;
yet I would never have thought I could learn so much from a threads book.
How wrong I was!

Now, that said, there is a slight inaccuracy WRT OS/2 on page 102 : there
is SMP support in OS/2 version 2.11 SMP ; and OS/2 Warp Server Advanced
SMP.

These versions of OS/2 have special kernels modified for SMP, and not all
device drivers work properly with it ; but all 32-bit OS/2 apps that I
have ever tried on it worked. I have found problems with some older 16-bit
OS/2 apps that didn't behave, because they were relying on an old
Microsoft C runtime library which used "fast RAM" semaphores that aren't
SMP safe. The problem was fixed by marking the executable as uniprocessor
with a utility provided with OS/2 SMP - so that its threads would always
run on the same CPU.

These problems with device drivers and many 16-bit apps are probably part
of the reason why IBM hasn't been shipping the SMP kernel in the regular
version of OS/2 (Warp 4). Warp Server SMP does make a very nice operating
system though (I run it at home on one system - see
http://strange.thetaband.com/).

-- 
--------------------------------------------------------------------
Julien Pierre               http://www.madbrain.com
Theta Band Software LLC     http://www.thetaband.com
--------------------------------------------------------------------  
  

=================================TOP===============================
 Q257: Searching for an idea  
 
Eloy,

  Sure... Let's see what we can think up here...

  How about one of these:

o  The majority of client/server applications are limited more
   by disk I/O than by CPU performance, thus Java's slower computing
   power is less of an issue than in other areas.  (A) is this really
   true?  (B) What configuration of threads do you need to match the
   performance of C programs for a simple, well-defined problem?  (C)
   What do you need to do with java to obtain this level of performance?

o  One problem with Java's wait/notify architecture is that, for problems
   like consumer/producer, many redundant wakeups may be required in 
   order to ensure that the RIGHT threads get woken up (by use of notifyAll()).
   For an optimally configured program, what is the reality of this problem?
   (See my article in Aug. Java report for a lenghty description of this.)

o  Java native threads (on Solaris 2.6, JDK 1.2) use "unbound" threads.  These
   are *supposed* to provide adaquate LWPs for any I/O bound problem so that
   the programmer doesn't need to call the native methods for thr_setconcurrency().
   How well does this work for "realistic" programs?  (Can you find any 
   commerical applications that face this issue?)

-Bil

> 
>         I´m a spanish computer science student, searching for an idea
> for my final project, before getting my degree; but as of today, I
> haven´t found it.
> 
>         Can you give me any ideas? I´m interested in JAVA, especially in
> multithread programming.
> 
>         If you would like to help me, please send an e-mail to:
> [email protected]
>                                                     Thank you very much.
>            Eloy Salamanca. Tlf: 954 360 392  (Spain). E-mail:
>                       [email protected]

-- 
===============
Bil LambdaCS.com

=================================TOP===============================
 Q258: Benchmark timings from "Multithreaded Programming with Pthreads"  

I ran some of benchmark timings from Bil Lewis's book "Multithreaded
Programming with Pthreads" to get a rough idea how LinuxThreads compares
with PMPthreads. I only have a uniprocessor (Intel 200 MHz MMX w/Red Hat
Linux 5.1) to test with, but the results are interesting anyway.

In case you have an interest in running the benchmarks yourself, I have
attached the performance programs distribution that compiles on
LinuxThreads and PMPthreads. You need to recompile the tools for each
library. Use "make -f Makefile.linuxthreads" to build the LinuxThreads
version, and "make -f Makefile.pmpthreads to build the PMPthreads
version. Use "make -f Makefile.pmpthreads clean" between recompiles. The
"test_timings.sh" script runs the tests. I'd be interested in the
results others get.

Here are the results I got:

                                 (second to complete)
                              PMPthreads      LinuxThreads
                              ----------      ------------
lock                            8.09             10.15
try lock                        3.77              8.69
reader                          8.91              6.24
writer                          9.15              6.52
context switch (bound)         10.82             49.18
context switch (unbound)       10.82             49.17
sigmask                        19.19              6.15
cancel enable                   9.65              4.54
test cancel                     2.06              3.94
create (bound)                  1.61             44.84
create (unbound)                1.61             45.64
create process                 13.25             15.08
global                          4.23              4.20
getspecific                    10.31              2.53

Looks like LinuxThreads pays a big price for thread creation and context
switching. The raw data for these results is included in the attached
file.

With some verifications of the results and some commentary, this might
be worth a page on the Programming Pthreads website.

$cott
=================================TOP===============================
 Q259: Standard designs for a multithreaded applications?  

> Hi  All,
>   I want to know whether there are any standard design techniques for
> developing a  multithreaded application.   Are there any
> books/documents/websites which discuss multithreaded design issues?  I
> am mainly interested in design issues which help different programmers
> working on the same project to coordinate in developing a multithreaded
> application.
>   Any suggestion or experience in this regard is welcome.
>    I am interested in designing or reverse engineering multithreaded
> server applications using C and not C++ or Java.

There are a great many books that cover parallel programming: 
algorithms, programming models, library APIs, etc.  However, 
few cover the design and construction of parallel software.  
The following texts may be more relevant than most:


Multithreading Programming Techniques 
By Prasad, Shashi

Online Price: $39.95
Softcover; 410 Pages
Published by McGraw-Hill Companies
Date Published: 01/1997
ISBN: 0079122507

[http://www.amazon.com/exec/obidos/ASIN/0079122507/qid%3D916076814/002-8883088-6545834](https://www.amazon.com/gp/product/0079122507/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0079122507&linkId=6d506a6b916d990758e1bd7c923ab4e9)
--------

Structured Development of Parallel Programs 
By Pelagatti, Susanna

Online Price: $44.95
Softcover; 600 Pages
Published by Taylor and Francis
Date Published: 11/1997
ISBN: 0748407596

[http://www.amazon.com/exec/obidos/ASIN/0748407596/qid=916076864/sr=1-1/002-8883088-6545834](https://www.amazon.com/gp/product/0748407596/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0748407596&linkId=e2b118246062f2ee063973f443b88263)
--------

Designing and Building Parallel Programs : 
    Concepts and Tools for Parallel Engineering 
By Foster, Ian T.

Online Price: $50.95
Hardcover; 600 Pages
Published by Addison-Wesley Publishing Company
Date Published: 12/1994
ISBN: 0201575949

[http://www.amazon.com/exec/obidos/ASIN/0201575949/o/qid=916076740/sr=2-1/002-8883088-6545834](https://www.amazon.com/gp/product/0201575949/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0201575949&linkId=a8182bf9fd13732b70c0c46019482c9c)
--------

Foundations of Parallel Programming 
By Skillicorn, David

Online Price: $39.95
Hardcover; 
Published by Cambridge University Press
Date Published: 12/1994
ISBN: 0521455111

[http://www.amazon.com/exec/obidos/ASIN/0521455111/qid=916076568/sr=1-3/002-8883088-6545834](https://www.amazon.com/gp/product/0521018560/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0521018560&linkId=d20e75997191664e7eaaf2c814925ca1)

--
Randy Crawford
[email protected]
[email protected]


=================================TOP===============================
 Q260: Threads and sockets: Stopping asynchroniously  

Neil--

I've found that the cleanest way to do this (with regards to portability)
is to set up a unique pipe for every thread that you might want to 
interrupt.  Instead of doing a read() in your reader thread you'd use
select():

FD_ZERO(&fds;);
FD_SET(reader_fd, &fds;);
FD_SET(msgpipe_fd, &fds;);

ready = select(highestfd+1, &fds;, NULL, NULL, NULL);
if (FD_ISSET(msgpipe_fd, &fds;)) {  /* We've been interrupted */
    .. drain the pipe ..
    .. handle the event gracefully ..
}

if (FD_ISSET(reader_fd, &fds;)) {   /* We've received data */
    .. grok the data ..

}


Now, from your controlling thread, (the one which is interrupting the blocked
thread) you could write 1 byte to the 'msgpipe_fd' file descriptor, which
would wake that thread up from it's select().

This seems like a lot of work, but it's probably the only portable way
of accomplishing this task.  Trying to do this with signals is ugly and
potentially unreliable.

Hope this helps,
-S


=================================TOP===============================
 Q261: Casting integers to pointers, etc.  

> Oh Lord!  Is that true?  "casting integers to pointers..." ?  Who the
> !@$!@$ came up with this idea that casting is allowed to change bits?
> If *I* were King...

    'Tis true.  A cast doesn't mean "pretend these bits are of type X,"
it is an operator meaning "convert this type Y value to the type X
representation of (approximately) the same value."

    For example:

    double trouble = 3.14;
    double stubble = (int)trouble;

Surely the `(int)' operator is allowed to "change bits," is it not?

> I do not know of any machines where this does not work, however.  DEC,
> Sun, HP, SGI, IBM all cast back and forth as expected.  Are there any?

    There are certainly machines where `int' and `void*' are not even
the same size, which means convertibility between `int' and `void*' cannot
possibly work for all values.  I believe DEC's Alpha (depending on compiler
options) uses a 32-bit `int' and a 64-bit `void*'; there are also ugly
rumors about various "memory models" on 80x86 machines.

    In any case, it's not a crippling restriction.  You want to pass an
`int' (or a `double' or a `struct foobar' or ...) to a thread?  No problem,
just a slight clumsiness:

    struct foobar x;
    pthread_create (..., func, &x;);    /* or `(void*)&x;' if there are no
                     * prototypes in scope */

    ...
    void func(void *arg) {
        struct foobar *xptr = arg;
        struct foobar xcpy = *(struct foobar*)arg;  /* alternative */

----
[email protected]


 Eric,

>     'Tis true.  A cast doesn't mean "pretend these bits are of type X,"
> it is an operator meaning "convert this type Y value to the type X
> representation of (approximately) the same value."

  Grumble.  Oh well.

> > I do not know of any machines where this does not work, however.  DEC,
> > Sun, HP, SGI, IBM all cast back and forth as expected.  Are there any?
> 
>     There are certainly machines where `int' and `void*' are not even
> the same size, which means convertibility between `int' and `void*' cannot

  Are there?  I don't doubt that there WERE, but any today?


>     In any case, it's not a crippling restriction.  You want to pass an
> `int' (or a `double' or a `struct foobar' or ...) to a thread?  No problem,
> just a slight clumsiness:

  BIG clumsiness.

  It is also true that everyone of us who have written on the subject have
completely ignored this little detail.

  Thanks for the insight.
 

=================================TOP===============================
 Q262: Thread models, scalability and performance   

In reference to your comment below on mutex context switching you are wrong.

When thread A releases the lock, time slicing may not switch in quick enough
for thread B and C to grab the lock. And thread A's execution may be quick
enough to grab the lock even before the OS allows B or C to attempt to grab
the lock. This scenario usually occurs on high threaded applications, where
some threads seem to be starved. You can actually see this if you were using
a thread profiler (real-time) such as Rational Quantify-Purify or Failsafe
on AIX.


In reference to data sinks and streams. Remember, large scale string objects
do not perform well, and byte arrays is the choice mechanism. But streams
also provide the ability to be buffered, so that the mechanism for writing
to the stream performs better when sending large amounts of data from one
thread to another using a stream.  

In my case I have a MQ channel which does messaging to the mainframe on
AIX. I have a socket server which can take in requests to place them on the
queue to be processed on the other side of the recieving channel.

What I do is divide the socket itself into two streams on two separate
threads, one talking to the client while the other is sending data onto the
channel and getting data from the channel. The data (or message) is huge in
size and the queue manager usually breaks it up.  But after I get all the
segments back, I need to reformat it slightly and send it back out on the
socket.

Using a stream to talk to the threads provides the fastest way to send raw
string data. Remember this is an asych operation. An asych operation is
faster then waiting for the queue manager to reply and then sending out the
message to the user. Streams are a better design.

But if you want to send simple messages then objects are easier, just send
into each thread the object reference and have the threads talk to the
object to bridge the threads in communication.

I prefer the streams mechanism overall.

I think the statement you made is missing what application example I gave,
so I will recap it.

I have a socket based server, where I create the master worker object first
- then send the sock into it when I get the accept().

I then spawn two smaller worker threads which communicate to the socket (one
reading the other writing). I have the two threads communicating via
streams. The one that is reading the from the socket is getting the message
that needs to be relayed to the mq channel, while the other thread is
writing to the socket and getting the data from the mq channel.

I use this same mechanism for another server that does database work on DB2
on AIX as well.

Currently I am revamping the whole server and implementing the JGL
containers into the model.

Thanks
Sean






Bil Lewis wrote:

> Sean,
>
> Thanks 10^6!  That helps a lot.  A couple  of thoughts:
>
> > Thread A hits the method, gets a lock on the
> > method and begins to write to the file system. Threads B and C are attempting to write
> > to the logs too, but can not get a lock since thread A has it and is not finished. So
> > threads B and C wait. Now the operating system has time slicing and places those threads
> > in a suspend state. Thread A completes writing to the logs, and releases the lock. Now
> > thread A does some processing. Thread B and C are still in suspend state. Meanwhile
> > thread A attempts to write to the logs again. It gets a lock and does so. Meanwhile
> > thread B and C come out of suspend (due to the operating system time slicing the
> > threads) and they try to write to the logs but can not again. They suspend, and the
> > cycle repeats over and over again.
>
> It better not repeat!  The very next time A unlocks the mutex, A will be
> context switched (it now has lower priority that B & C), and the lock
> will be available.  Certainly this is the behavior I (think!) I see on
> Solaris. ??
>
> > >
> > >   This surprises me.  Using a stream to communicate between threads?  This
> > > would certainly be 10x slower than using simple wait/notify.  (?!)
> > >  Streams are the basis for talking on the
> > socket which is interprocess communication, right? You have a client on a process who is
> > communicating to a remote process, the socket is the communication, but the streams off
> > the socket provide the fine grain communication.
>
> I can certainly see that you can do this, and in IPC it makes some sense.
> But I don't see it in threads.  It would be both limiting and slow.  (Let's
> see...  I have a Java program where I pump ~100 strings/sec across a socket.
> Versus 10,000+ communications via synchronized methods.) ?
>
> Best Regards,
>
> -Bil

=================================TOP===============================
 Q263: Write threaded programs while studying Japanese!   

Yes, indeed!  You too can now learn the subtle beauty of
the Japanese language while advancing your programming
skills in Pthreads!


         Hurry!  Hurry!  Hurry!


I just got a copy of the Japanese translation of both
Dave Butenhof's book and my own.  It's great to see all 
this Kanji and then "MUTEX" in English.  I used to live
in Kenya, where this happened all the time.  It was
pretty funny.  "Mimi, sijui engine block iko wapa."
 

=================================TOP===============================
 Q264: Catching SIGTERM - Linux v Solaris  
 
Lee,

  I didn't notice you declaring the signal handler. You need to
have a signal handler (even tho it'll never get called!) set up
for many signals in order to use sigwait().  The handler turns
off the default signal action.

-Bil

> 
> I wonder if anyone could shed light on the following problem I
> am having. Early in my servers execution I create a thread:
> to wait on SIGTERM (and SIGINT) and shutdown the server cleanly.
> The shutdown thread works as expected when compiled on Linux
> (libc5 + LinuxThreads, SuSE 5.2) but it doesn't seems to catch
> the signals on Solaris (only tried 2.6). The shutdown thread
> is as follows:
y information.
 
=================================TOP===============================
 Q265: pthread_kill() used to direct async signals to thread?  

Darryl,

  Yes, you can.  But no, you don't want to.

  What were you using your signal handlers for?  To provide
some version of asynchronous processing.  But now you can do
that processing synchronously in another thread!  (This is
a *good* thing.)

  For backwards-compatibility you may wish to retain the basic
model of sending signals to the process, but you can do that with
sigwait() in one thread, blocking it out from all other threads.

  So look at your app carefully.  It is very likely that you can
make it simpler and more robust with less effort.

-Bil

 
> I'm porting a multi-process based application into a thread
> environment.  So I've read about the traditional signal model using
> sigaction() and sigprocmask() etc,  and the "new" signal model using
> sigthreadmask() and sigwait() etc ....  But, can't I just redirect my
> old SIGABRT, SIGUSR signals (previously between processes) now to a
> specific thread with pthread_kill()?    Sure if someone issues a command
> line kill() with a SIGUSR then that delivery will be indeterminate since
> it is enabled for all the threads but with enough global state data the
> handler can probably manage to ignore that one.  Have I missed something
> here?
 

=================================TOP===============================
 Q266: Don't create a thread per client  
David Snearline wrote:
> 
> Bil Lewis wrote:
> 
> > Nilesh,
> >
> >   While it's certainly interesting to experiment with lots of threads, I
> > don't think you want to do anything serious like this.  I think you'll be much
> > happier doing a select() call and creating only a few dozen threads.
> >
> > -Bil
> >
> > >
> > > My applications runs on a Sun Sparc station with solaries 2.6 and I am using the
> > > POSIX library.
> > >
> > > The application is a server and creates a thread for each  connection accepted
> > > from a client,
> > > potentially the server is expected to handle upto 1000 connections. Therefore
> > > the server is expected
> > > to create upto 1000 threads.
> > ilesh
> 
> Greetings,
> 
> I was somewhat intrigued by your comment here, and was wondering what the rationale
> was behind it.  I've done many servers under Solaris using an accepting thread plus a
> thread per connection, and so far, I've been pretty happy with results.  Then again,
> this usually involves a hundred threads or so max -- not a thousand.
> 
> Since most of the threads end up being blocked in the kernel on I/O, the only
> drawback I can see are the per-thread resources of the (mostly) idle threads.
> Provided that these resources are kept small, running a thousand threads shouldn't be
> a problem.  Is there more here that I'm missing?

Oh, it's just that you're using up all this memory for the threads and you don't 
need to.  Might as well have one thread block on 1000 fds as have 1000 threads
each blocking on one.  For large numbers of fds, I'd expect to see some real
performance gains.

-Bil 


=================================TOP===============================
 Q267: More thoughts on RWlocks  

>     As many of you know the first and second readers-writers problems
> always starves either the readers (first) or the writers (second) .. I
> learned this in my operating systems textbook.....  I was reading along
> anticipating the solution which would not starve either the readers or
> the writers, but he then referred me to the bibliography page.....  And
> it wasn't much help..  Does anyone know of a solution which does not
> starve either one...

i'll stick my neck out on this one...

If starvation is a problem then RWlocks are not the answer.

RWlocks are only useful when writers are rare.  That's why
writer-preference makes sense.

If writers are so common that they can occupy the lock for 
significant amounts of time, then you shouldn't be using
RWlocks.  Actually, if they are so common, what you should be
doing is buying faster hardware! Or lowering the number of 
incoming requests.

Sure, you can always find an exceptional case where a special
version of RWlocks will give good results, but we're not trying
to solve imaginary problems here. For real problems the correct
answer is "Don't do that!"

-Bil


-- 
===============
Bil LambdaCS.com

=================================TOP===============================
 Q268: Is there a way to 'store' a reference to a Java thread?  

> Is there a way to 'store' a reference to a thread at a certain point and
> run a command in that thread at a later point in time?

  Of course there is!  (Just depends on what you really mean.)

RunnableTask task = new RunnableTask();
Thread t = new Thread(task);
       ^ reference
t.start()
...


task.addCommandToTaskQueue(new Command());
     (this puts the task on the queue and wakes up the thread if
      sleeping.)


  This may not be what you were THINKING of, but it's probably
what you REALLY want.

-Bil
=================================TOP===============================

 Q269: Java's pthread_exit() equivalent?   

[Simple question, I thought.  LOTS of answers!  For my own use and
for my Java Threads book I implemented InterruptibleThread.exit(),
but there is a lot of logic to insisting that the run() method
be the one to simply return. -Bil]


Bil Lewis writes:
 > Doug,
 >  
 >     A question for you.
 >  
 >    In POSIX, we have pthread_exit() to exit a thread.  In Java we
 >  *had* Thread.stop(), but now that's gone.  Q: What's the best way
 >  to accomplish this?
 >  
 >    I can (a) arrange for all the functions on the call stack to
 >  return, all the way up to the top, finally returning from the
 >  top-level function.  I can (b) throw some special exception I
 >  build for the purpose, TimeForThreadToExitException, up to the
 >  top-level function.  I can throw ThreadDeath.
 >  
 >    But what I really want is thread.exit().
 >  
 >    Thoughts?
 >  
 >  -Bil
 > -- 
 > ================
 > Bil LambdaCS.com
 > 
 > http://www.LambdaCS.com
 > Lambda Computer Science 
 > 555 Bryant St. #194 
 > Palo Alto, CA,
 > 94301 
 > 
 > Phone/FAX: (650) 328-8952
 > 

Here's a real quick reply (from a slow connecction from
Sydney AU (yes, visiting David among other things)). I'll
send something more thorough later....

Throwing ThreadDeath yourself is a pretty good way to force current
thread to exit if you are sure it is in a state where it makes sense
to do this.

But if you mean, how to stop other threads: This is one reason why
they are extremely unlikely to actually remove Thread.stop(). The next
best thing to do is to take some action that is guaranteed to cause
the thread to hit a runtime exception. Possibililies range from the
well-reasoned -- write a special SecurityManager that denies all
resource-checked actions, to the sleazy -- like nulling out a pointer
or closing a stream that you know thread needs. See
  http://gee.cs.oswego.edu/dl/cpj/cancel.html
for a discussion of some other alternatives.


-Doug
 

Hi Bil,
Here's the replies I got to your question.

     Peter

------------------------------------------------------------------------
---
Check out the following url's. They give a good description of the
problem and implementation details for better ways to stop a thread
gracefully.


http://java.sun.com/products/jdk/1.2/docs/guide/misc/threadPrimitiveDeprecation.
html
http://gee.cs.oswego.edu/dl/cpj/cancel.html

Brian

---------------------------------------------------------------------
rom: Jeff Kutcher - Sun Houston 
Subject: Re: A threadbare question
To: Peter.Vanderlinden@Eng
MIME-Version: 1.0
Content-MD5: KVELBotxnHX+d34FMCMY4g==


Here's a suggestion:

    private Thread thread = null;

    public void start() {
        if (thread == null) {
            thread = new Thread(this);
            thread.start();
        }
    }

    public void stop() {
        thread = null;
    }
    
    public void run() {
        while (thread != null) {
            try {
                ...
            } catch (InterruptedException e) {
                thread = null;
                // using stop() may cause side effects if the class is extended
            }
        }
    }
    

--------------------------------------------------------------------------
-------

>From Lee.Worrall@UK  Tue Aug 18 09:03:12 1998

I believe the recommended way to exit a thread is have it drop out of the bottom 
of its run() method (probably in response to some externally triggered event).

lee

> Date: Tue, 18 Aug 1998 08:53:45 -0700 (PDT)
> From: Peter van der Linden 
> Subject: A threadbare question
> To: [email protected]
> 
> A thread knowledgeable colleague asks...
> 
> ------------------
> 
> In POSIX, we have pthread_exit() to exit a thread.  In Java we
> *had* Thread.stop(), but now that's gone.  Q: What's the best way
> to accomplish this?
> 
>   I can (a) arrange for all the functions on the call stack to
> return, all the way up to the top, finally returning from the
> top-level function.  I can (b) throw some special exception I
> build for the purpose, TimeForThreadToExitException, up to the
> top-level function.
> 
>   But what I really want is thread.exit().
> 
> -----------------
> 
> Anyone have any ideas?
> 

 
Seek, and ye shall find.

    Peter
    
------------- Begin Forwarded Message -------------

ThreadDeath is an Error (not an Exception, since app's routinely
catch all Exceptions) which has just the semantics you are talking
about: it is a Throwable that means "this thread should die".  If
you catch it (because you have cleanup to do), you are SUPPOSED to
rethrow it.  1.2 only, though, I think.  Thread.stop() uses it, but
although stop() is deprecated, it appears that ThreadDeath is not.

I think.  :^)

Nicholas
 
 
>   I was feeling much more sure of myself before you asked the question.
> Now I need to think.  There are certainly situations where you do wanr
> to exit a thread.  The options would seem to be a Thread.exit() method,
> or an explicit throw of an exception.  What else?  (You sure wouldn't
> want to have "special values" to return from functions to get to the
> run() method.)
> 
>   If Java had named code blocks, you could do a direct goto.  But I
> don't see why that would be good.  Not in general.
> 
>   I don't see any logic for insisting on having an explicit exception.
> (Is there?)

You mean as in throwing an exception in another thread to indicate
termination status of the dying thread? If so: no, there isn't,
although it is always possible to hand-craft this kind of effect.

> 
>   No, I think that a Thread.exit() method defined to throw ThreadDeath
> is the way to go.

In which case, there is no real need for a method; just `throw new
ThreadDeath()' would do. 

When thread.stop() was in the process of being deprecated I argued
that there should be a Thread.cancel() method that is defined as

  setCancelledBit();
  interrupt()

along with a method isCancelled(), and an associated bit in the Thread
class. The idea is that interrupts can be cleared, but the cancel bit
is sticky, so reliably indicates that a thread is being asked to shut
down. But apparently some people (I think database folks) really want
the freedom to do retries -- in which case they must clear interrupts,
catch ThreadDeaths, and so on, and don't want anything standing in the
way of this.

> (PH is talking to me about modifying my PThreads book into Java.  I'm
> sort of mixed on the idea.)

I think it would be great to have something a lot better than Oaks and
Wong as the `lighter', gentler, more traditionally MT-flavored
alternative to my CPJ book. I think you could do a lot of good in
helping people write better code. (Tom Cargill has been threatening to
write such a book for years, but I don't think he will.)

(People would then complain that your book is insufficently OO, making
a perfect complement to complaints that my book is insufficiently MT :-)

BTW, Have you seen my util.concurrent package? (see
http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent).
I'd be very interested in your reactions. I'm trying to standardize
some of the more common utility classes people use in concurrent
programming.




Doug,

> >   If Java had named code blocks, you could do a direct goto.  But I
> > don't see why that would be good.  Not in general.
> >
> >   I don't see any logic for insisting on having an explicit exception.
> > (Is there?)
> 
> You mean as in throwing an exception in another thread to indicate
> termination status of the dying thread? If so: no, there isn't,
> although it is always possible to hand-craft this kind of effect.

  No.  "Named blocks" isn't a familiar term?  It's just a clean way of
doing longjmp().  Java *should * have it.

> >   No, I think that a Thread.exit() method defined to throw ThreadDeath
> > is the way to go.
> 
> In which case, there is no real need for a method; just `throw new
> ThreadDeath()' would do.
> 
> When thread.stop() was in the process of being deprecated I argued
> that there should be a Thread.cancel() method that is defined as
> 
>   setCancelledBit();
>   interrupt()
> 
> along with a method isCancelled(), and an associated bit in the Thread
> class. The idea is that interrupts can be cleared, but the cancel bit
> is sticky, so reliably indicates that a thread is being asked to shut
> down. But apparently some people (I think database folks) really want
> the freedom to do retries -- in which case they must clear interrupts,
> catch ThreadDeaths, and so on, and don't want anything standing in the
> way of this.

  I was a bit leary on interrupts until I looked at them more closely.
I think now that they're pretty reasonable.

  So the last remaining question for me is: "Should I do an explicit
throw?  Or just call stop() anyway?"  (I don't want to write my own
subclass BilsThread that implements a java_exit() method.)
 
> > (PH is talking to me about modifying my PThreads book into Java.  I'm
> > sort of mixed on the idea.)
> 
> I think it would be great to have something a lot better than Oaks and
> Wong as the `lighter', gentler, more traditionally MT-flavored
> alternative to my CPJ book. I think you could do a lot of good in
> helping people write better code. (Tom Cargill has been threatening to
> write such a book for years, but I don't think he will.)
> 
> (People would then complain that your book is insufficently OO, making
> a perfect complement to complaints that my book is insufficiently MT :-)

  Touche'!
 
> BTW, Have you seen my util.concurrent package? (see
> http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent).
> I'd be very interested in your reactions. I'm trying to standardize
> some of the more common utility classes people use in concurrent
> programming.

  As soon as I get back from Utah...




Doug Lea wrote:
> 
> > > But I would only do this if for some reason using interrupt() had to
> > > be ruled out.
> >
> >   ?  interrupt() is unrelated.  I assume my thread has aready gotten the
> > interrupt and has decided to exit.  I've got to check with one of the
> > Java guys, just to get their story on it.  (I'm surprised this has been
> > asked 6k times already.  Something's odd...)
> 
> I think interrupt IS related.  It seems best to propagate the
> interrupt all the way back the call chain in case you have something
> in the middle of the call chain that also needs to do something
> important upon interruption. Don't you think?

  I was thinking in terms of situations were you KNOW your data is
consistant and you've determined that it's time to exit and there's
nothing else to do.  An event which does occur...  Often?  Sometimes?
Only in programs that I write??

  But I see your point.

-Bil


Hi Bil,

Just a comment on the stop(), ThreadDeath issue in Java. The comment you 
include from Nicholas is inaccurate and misleading.

"ThreadDeath is an Error (not an Exception, since app's routinely
catch all Exceptions) which has just the semantics you are talking
about: it is a Throwable that means "this thread should die".  If
you catch it (because you have cleanup to do), you are SUPPOSED to
rethrow it.  1.2 only, though, I think.  Thread.stop() uses it, but
although stop() is deprecated, it appears that ThreadDeath is not."

Yes ThreadDeath is derived from Error. The term "exception" means 
anything that can be thrown. "exceptions" which are derived from 
Exception are checked "exceptions" and must be caught by the caller or 
delcared in the throws clause of the caller. But this is not important.

There is *nothing* special about a ThreadDeath object. It does not mean 
"this thread should die" but rather it indicates that "this thread has 
been asked to die". The only reason it "should" be rethrown is that if 
you don't then the thread doesn't actually terminate. This has always 
been documented as such and is not specific to 1.2.

If a thread decides that for some reason it can continue with its work 
then it can simply throw  new ThreadDeath() rather than calling stop() 
on itself. The only difference is that with stop() the Thread is 
immediately marked as no longer alive - which is a bug in itself.

Cheers,
David
 
Doug Lea wrote:
> 
> >   I see your point.  Still seems ugly to me though.  (Now, if *I* were
> > king...)
> 
> I'd be interested in your thoughts about this, or what you would like
> to see. I used to think I knew what would be better, but I am not so
> sure any more.
> 
> -Doug

  I was feeling much more sure of myself before you asked the question.
Now I need to think.  There are certainly situations where you do want
to exit a thread.  The options would seem to be a Thread.exit() method,
or an explicit throw of an exception.  What else?  (You sure wouldn't
want to have "special values" to return from functions to get to the
run() method.)

  If Java had named code blocks, you could do a direct goto.  But I
don't see why that would be good.  Not in general.

  I don't see any logic for insisting on having an explicit exception.
(Is there?)

  There's plenty to be said about how to ensure consistant data in
such situations.  But I don't think that has to determine the exit
method.

  No, I think that a Thread.exit() method defined to throw ThreadDeath
is the way to go.

-Bil

(PH is talking to me about modifying my PThreads book into Java.  I'm
sort of mixed on the idea.)
-- 
===============
Bil LambdaCS.com

=================================TOP===============================
 Q270: What is a "Thread Pool"?  

 
> So I want to allocate a pool of threads and while the program
> is executing I want to use the threads to service all the different
> modules in the program. This means that there are going to be
> times where I want to change the addresses of procedures that
> threads are using.
> 
> So I create the thread pool with all threads have NULL function
> pointers and all threads created are put to sleep.
> Some time later different modules want to be serviced so I look and
> see if a thread is availible and if one is then I assign the function
> to that thread and make the thread active which starts execution
> of the assign function. After the thread finishes it would be put to
> sleep...and avalible for use by another module....
> 
> is this possible or have I been smoking too much crack?



Rex asks a question which we've seen here a dozen times.
It's a reasonable question and kinda-sorta the right idea
for a solution, but the angle, the conceptual approach, the
metaphor is wrong.

The term "thread pool" conjures up a temp agency where you
wake up typists when you need them and give them a job to do.

This is a lousy way to think of programs. You shouldn't be
thinking about "giving the threads work to do". You should
be thinking about "announcing that there is work to do" and
letting threads pick up that work when they are ready.

The Producer/Consumer model is the way to go here. A request
comes in off the net, the producer puts it on a queue, and a
consumer takes it off that queue and processes it. Consumer
threads block when there's nothing to do, and they wake up and
work when jobs come along.

Some will argue that "Thread Pool" is the same thing. Yes, but.
We've seen SO many questions about "stopping and starting" 
threads in a pool, "giving a job" to a specific thread etc.
People try to implement something along these lines and totally
mess up.

Read my book. Read Dave's book. Read pretty much any of the 
books. We all say (more or less) the same thing.

So, don't think "Thread Pool", think "Producer/Consumer". You'll
be happier.

A good example of a Producer/Consumer problem can be found in
the code examples on www.LambdaCS.com.

-Bil


=================================TOP===============================
 Q271: Where did "Thread" come from?  

I just picked up on your post to comp.programming.threads, and I noticed
your (?) concerning the term "thread."  I first heard this term used in the
late '60 in a commercial IBM S/360 shop here in Dallas, Tx.  One of the
"heavy weights" (Jim Broyles, still at BC/BS of TX) was writing/developing a
general purpose TP monitor: the "host" was a 360-40 running DOS, (DOS
supported 3 application "partitions": BG, FG1, FG2): the
lines/controllers/terminals managed were IBM 2260 (or "look-alikes".)  I do
not know how many threads Jim's TP monitor used, but this system was used a
BC for almost 10 years.  The system was written in assembler.  All of this
was "pre" CICS, TSO, etc.

Jim Broyles went on to become manager of System Programming for BC/BS ..  I
worked for him for maybe 5-6 years in the mid '70's.  Support for
application threading in S/360 DOS was likely pretty "limited", but big "OZ"
... S/360 OS-MFT/MVT, SVS, MVS provided good facilities for
multi-programming, and, IBM was pushing MP and AP (smp) systems.  We had a
S/370 AP system installed when I left BC/BS (1979).

Net/net, the term has "been around a while."

=================================TOP===============================
 Q272: Now do I create threads in a Solaris driver?  

Kernel space is a different beast from user space threading and I
don't deal with that myself.  BUT I know that Solaris kernel threads
are very similar to POSIX threads. You can run them from device drivers.
The DDI should have the interface for threads, but like I said, I've
never paid it much attention.

I would think that a call to your support line should tell you where to
look.


> 
> Hi, I found your threads FAQ web page and wondered if you'd mind answering
> a question for me.  I'm writing a miscellaneous driver for Solaris (that is
> it isn't tied to hardware) and would like to know how to create my own
> threads in kernel space.  At first glance, there appears to be no support
> for this through the usual DDI/DDK means.  Is this the truth ?  Is there
> a way around this ?  Or is the best way to fake it by doing something like
> using a soft interrupt or timeout to start a function that never returns ?
> 
> Darren

=================================TOP===============================
 Q273: Synchronous signal behavior inconsistant?  


Antonio,

Yes, it *seems* weird, but it's not. (Well, maybe it is still weird,
but at least there's some logic to it.)

If a program accesses unmapped memory, it will trap into the kernel,
which will say to itself something like "What a stupid programmer!"
and then arrange for a SIGSEGV for that program. Basically it will
pick up the program counter right then and there and move it to the
signal handler (if any). That's how synchronous signals work.

If you send a signal, any signal, to the process yourself, that will
be an asynchronous signal. EVEN if it's SIGSEGV or SIGBUS. And the
sigwaiter will then be able to receive it.

-Bil

> So, I guess things are not working quite right in that sometimes a
> blocked signal is not delivered to the - only -  thread which is waiting
> for it.
> I coded an example in which SIGBUS is blocked and a thread is on
> sigwait. I arranged the code so that SIGBUS is "internally" generated,
> i.e. I coded a thread that is causing it on purpose. The process goes
> into a spin.
> If I kill the process with -10 from another shell, the result is as
> expected (the thread on sigwait catches it).
> I find that a little weird.
> 
> Thanks for your suggestions,
> Antonio
> 
> Sent via Deja.com http://www.deja.com/
> Before you buy.

-- 

=================================TOP===============================
 Q274: Making FORTRAN libraries thread-safe?  

"James D. Clippard" wrote:

> I have a need to use several libraries originally written in FORTRAN as part
> of a numerically intensive multithreaded application.  The libraries are
> currently "wrapped" with a C/C++ interface.
>
> -----
> My question is: How might one safely accomplish such a task, given FORTRAN's
> non-reentrant static memory layout?
> -----

The answer is really "it depends".  Firstly, with multi-threading you are going
beyond the bounds of what is possible in standard C/C++, so any solution is
by definition system dependent.  I'm not sure off hand if any version of
FORTRAN (eg 90 or 95) has standardised support for threading, but
somehow doubt it.  F77 never had any standardised support for multi-threading.

Second, "FORTRAN's non-reentrant static memory layout" is not strictly true.
It is definitely not true with F90 or F95.  With F77 (and before) things
are a little ambiguous --- eg lots of vendor specific extensions --- so you
will need to look at documentation for your compiler, or try a couple
of test cases like

            CALL TEST
            CALL TEST
            END

            SUBROUTINE TEST
            INTEGER I
            WRITE (*,*) i
             i = i+1
            RETURN
            END

to see what happens.  I recall (from F77 days) some keywords like
AUTO and SAVE that control whether a variable is static or
auto.  I don't know how widespread they were (or whether or
not they were standard), as my coding practice rarely relied
on that sort of thing.

If your FORTRAN code uses things like common blocks, then you
essentially have a set of static variables that you need to control
access to.  Much the same as you would need for accessing
static variables in C/C++.

In general, you are probably safest using some of the following
schemes.  None of these are really specific to FORTRAN.

1)  Provide a set of wrapper functions in C/C++.  Have the wrapper
functions use mutex's or similar to prevent multiple threads invoking
particular sets of FORTRAN functions.  For example, if
FORTRAN SUBROUTINE A calls B calls C, and you have
a wrapper for each, ensure that a call to C prevents a call to A
on another thread UNLESS you know that all variables in
C are auto.

2)  Control access to common blocks, as hinted above.


>
>
> BTW, one of libraries I am using is SLATEC.  Given that all NRL's base much
> of their research code on SLATEC, I suspect that someone has elegantly
> surmounted this problem.
>

=================================TOP===============================
 Q275: What is the wakeup order for sleeping threads?  

Raghu Angadi wrote:

> A. Hirche wrote:
> >
> > Is it
> > (a) the first thread in the queue (assuming there is an ordered list of
> > waiting threads)
> > (b) any thread (nondeterministic choice)
> > (c) a thread chosen by some other scheme
>
> Threads are queued in priority order.
>
> So the thread with the maximum priority will get the mutex.
>
> If there more than one threads with max priority, then it is
> implementation dependant.

Not quite!

Actually, POSIX places mutex (and condition variable) wakeup ordering
requirements only when:

  1. The implementation supports the _POSIX_THREAD_PRIORITY_SCHEDULING
     option.
  2. The threads waiting are scheduled using the SCHED_FIFO or SCHED_RR
     policies defined by POSIX.

If these conditions are true, then POSIX requires that threads be awakened
in priority order. Multiple threads of identical priority must be awakened
"first in first out".

The wakeup order for threads that don't use SCHED_FIFO or SCHED_RR, (e.g.,
the default SCHED_OTHER on many UNIX systems, which behaves more like
traditional UNIX timeshare scheduling), the wakeup order is implementation
defined. Most likely it's still priority ordered, but it need not be, and
there's no definition of how they may interact with POSIX policies.

And, in any case, except on a uniprocessor, saying that "the highest
priority thread gets awakened" is not the same thing as "the highest
priority thread gets the mutex". Some other thread on another processor
might like the thread first, and the high priority thread will just go back
to sleep. This can happen even on a uniprocessor if the thread that unlocked
the mutex has a priority equal to that of the highest priority waiter, (it
won't be preempted), and it locks the mutex again before the waiter can run.

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q276: Upcalls in VMS?  


Eugene Zharkov wrote:

> I am somewhat confused by the OpenVMS Linker documentation, by the part
> whcih describes the /threads_enable qualifier. Here is an extract from
> there:
>
> [...]
>
> What confuses me is the following. A section about two-level scheduling
> and upcalls in the Guide to DECthreads explicitly states that "this
> section applies to OpenVMS ALPHA only". The above description of the
> MULTIPLE_KERNEL_THREADS option states that the option is applicable only
> to ALPHA systems. The above description of the UPCALLS options does not
> mention a system it applies to. Does that means that the upcalls
> mechanism is actually implemented on OpenVMS VAX?

(Apologies to Dan Sugalski for apparently ignoring his answers, but since
he wasn't completely sure, I figured it was best to go back to the
beginning for a definitive resolution.)

Despite the cleverly leading hint in the documentation, you should assume
that neither upcalls nor kernel threads exist, nor will they ever exist,
on OpenVMS VAX. While most of the infrastructure for upcalls has been
implemented, there were some "issues" that were never resolved due to lack
of resources, and it has been officially deemed low priority.

Nevertheless, it is theoretically possible that, given enough signs of
interest, the implementation of upcalls on OpenVMS VAX could be completed.
There will never be kernel threads on OpenVMS VAX. (We all know that one
should never say "never".)

/---------------------------[ Dave Butenhof ]--------------------------\
=================================TOP===============================
 Q277: How to design synchronization variables?  


"Kostas Kostiadis"  writes:

> Are there any rules or techniques to build and test
> synchronisation protocols, or is it a "do what you think
> will work best" thing?

Look up the following paper:

Paper Title: ``Selecting Locking Primitives for Parallel Programming''

Paper Author: Paul E. McKenny

Where Published: Communications of the ACM, 
         Vol 39, No 10, 75--82, October 1996.

It is exactly what you need. In the paper, McKenny describes a pattern
language that helps in the selection of synchronization primitives for
parallel programs ...

cheers,

Ramanan
=================================TOP===============================
 Q278: Thread local storage in DLL?  


> I've written some memory allocation routines that may work when my DLL is called
> from multiple threads. The routines use thread local storage to store a table of the
> memory objects that have been allocated. I'm concerned that the code will not work
> properly when the dll is loaded explicitly using LoadLibrary.
> 
> Has anyone experienced that problem?
> Is there a simple solution for a DLL(Win32)?
> Can I allocate memory somehow in my main dllentry point routine?
> 
> Do I need to put a mutex around the calls to malloc to ensure that the code is
> thread-safe?
> 
> You can email me at [email protected]
> Rob

    Microsoft has explicitly stated that what you are doing will not work
when your DLL is loaded with LoadLibrary. The correct solution is to
explicitly use the Tls* functions. Read up on TlsAlloc, TlsFree,
TlsGetValue, and TlsSetValue.

    Of course, you can always roll your own by using GetCurrentThreadId()
to get an index into a sparse array protected with a CRITICAL_SECTION.


=================================TOP===============================
 Q279:  How can I tell what version of linux threads I've got?  


>>>>> "Phil" == Phil McRevis  writes:

    Phil> How can I tell what version of linux threads I've got on my
    Phil> system?  I browsed through the debian bug database and
    Phil> didn't find anything with "threads" in the list of packages
    Phil> or even what version of pthreads is included in the debian
    Phil> distribution.

executing glibc gives you useful information, as in:

levanti@meta:~$ /lib/libc-2.1.2.so
GNU C Library stable release version 2.1.2, by Roland McGrath et al.
Copyright (C) 1992, 93, 94, 95, 96, 97, 98, 99 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 2.95.2 19991109 (Debian GNU/Linux).
Compiled on a Linux 2.2.12 system on 1999-12-25.
Available extensions:
    GNU libio by Per Bothner
    crypt add-on version 2.1 by Michael Glad and others
    linuxthreads-0.8 by Xavier Leroy
    BIND-4.9.7-REL
    NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
    NSS V1 modules 2.0.2
    libthread_db work sponsored by Alpha Processor Inc
Report bugs using the `glibcbug' script to .

    Phil> What version of linux pthreads is considered to be the most
    Phil> stable and bug-free?

No idea.  Hopefully the most recent version.

    Phil> If I need to upgrade the pthreads package on debian, what's
    Phil> involved in doing that?

apt-get update ; apt-get upgrade

=================================TOP===============================
 Q280: C++ exceptions in a POSIX multithreaded application?  

On Sun, 16 Jan 2000 21:38:29 -0800, Bil Lewis  wrote:
>Jasper Spit wrote:
>> 
>> Hi,
>> 
>> Is it possible to use c++ exceptions in a POSIX multithreaded application,
>> without problems ?
>
>No. Using C++ exceptions is always a problem. (ho, ho, ho).
>
>But seriously... the interaction between exceptions & Pthreads
>is not defined in the spec. Individual C++ compilers do (or don't)
>implement them correctly in MT code. EG, Sun's 1993 C++ compiler
>did it wrong, Sun's current C++ compiler does it right.

Could you expand on that? What does it mean for a C++ to do it right?  If we
can put together a set of requirements to patch POSIX thread cancellation and
C++ together, I can hack something up for Linux.

The questions are:

- What exception is the cancellation request turned into in the target thread?
  What is the exception's type? What header should it be defined in?

- Upon catching the exception, what steps does the target thread take to
  terminate itself? Just re-enter the threads code by calling pthread_exit()?

- Are the handlers for unhandled and unexpected exceptions global or
  thread specific?

- Do unhandled cancellation exceptions terminate the entire process?

- By what interface does the process arrange for cancellations to turn into
  C++ exceptions?

- What is the interaction between POSIX cleanup handlers and exception
  handling? Do the handlers get executed first and then exception processing
  takes place? Or are they somehow nested together? 

- Does POSIX cleanup handling play any role in converting cancellation
  to a C++ exception?

In article , [email protected] suggested:
>On Thu, 16 Dec 1999 22:00:37 -0500, John D. Hickin 
>wrote:
>>David Butenhof wrote:
>>
>>> by. It would still be wrong. You need to use 'extern "C"' to ensure that the
>>> C++ compiler will generate a function with C calling linkage.
>>> 
>>
>>Also this:
>>
>>extern "C" void* threafFunc( void* arg ) {
>>  try {
>>     ...
>>  }
>>  catch( ... ) {
>>    return static_cast(1); // say;
>>  }
>>  return 0;
>>}
>>
>>It is manifestly unsafe to let a C++ exception unwind the stack of a
>>function compiled by the C compiler (in this case, the function that
>>invokes your thread function).
>
>To clarify; what you appear to be saying is that it's a bad idea to allow
>unhandled exceptions to percolate out of a thread function.

Actually, I think what he's saying it stronger than that, and I'd like to 
clarify it, since I'm finally updating a lot of my C++/DCE code to use C++ 
exceptions. He's saying not to *return* from inside a try or catch block, 
since it will force the C++-compiled code to unwind the stack past the C++ 
boundary and back into the C code.

Personally, one of the things that's kept me from using exception handling 
where I could avoid it was that I couldn't find a definitive answer as to 
whether it's safe and efficient to return like that. According to 
Stroustrup, it is, but this points out that it can be tricky in mixed 
environments.

--------
  Scott Cantor              If houses were built the way software
  [email protected]          is built, the first woodpecker would
  Univ Tech Services        bring down civilization.
  The Ohio State Univ            - Anon.
=================================TOP===============================
 Q281: Problems with Solaris pthread_cond_timedwait()?  


In article <[email protected]>,
John Garate   wrote:

> I can't explain why, but I can say that if you call
> pthread_cond_timedwait() with a timeout
> less than 10ms in the future that you'll get return code zero.  Since I
> call it in a loop, the
> loop spins until finally the timeout time is in the past and
> pthread_cond_timedwait() returns
> ETIMEDOUT.  This happens for me on Solaris 2.6.  If you call it with a
> timeout greater than 10ms in the future, it'll return ETIMEDOUT after
> waiting awhile, but it does so slightly BEFORE the requested time, which
> conflicts with the man-page.

This has nothing to do with spurious wakeups from pthread_cond_wait().
It is just a bug in Solaris 2.6 (and Solaris 7 and anytime before):

 BugID: 4188573
 Synopsis: the lwp_cond_wait system call is broken at small timeout values

This bug was fixed in Solaris 8 and is being patched back to Solaris 7.
There are no plans for patching it back to Solaris 2.6 and beyond.

True, the ETIMEDOUT timeout occurs up to a clock tick (10ms) before
the requested time.  This is also a bug, but has not been fixed.
Of course, expecting time granularity beter than a clock tick
is not a reasonable expectation.

Roger Faulkner



=================================TOP===============================

 Q282: Benefits of threading on uni-processor PC?

>Can someone please tell me what the benefits
>of threading are when the potential environment
>is a main-stream uni-processor PC?

The benefits are that you can interrupt the execution of some low priority
task to quickly respond to something more important that needs immediate
attention.  That is the real time event processing benefit.

Another benefit is that your program can do something while it is waiting for
the completion of I/O.  For example, if one of your threads hits a page fault,
your program can nevertheless continue computing something else using another
thread.

Those are the two main benefits: decrease the overall running time by
overlapping input/output operations with computation, and to control the
response times to events through scheduling, prioritizing and preemption.

The secondary benefit is that some problems are easy to express using
concurrency which leads to elegant designs.

>Concurrent execution and parallel execution are
>2 different things.  Adding the overhead that you
>get by using multiple threads, looks like a decrease
>in performance...

That depends. Sometimes it is acceptable to eat the overhead. If you have to
respond to an event *now*, it may be acceptable to swallow a predictably long
context switch in order to begin that processing.

>What is all this business about "better utilisation of
>resources" even on uni-processor hardware?

Multitasking was invented for this reason. If you run jobs on a machine in a
serial fashion, it will squander computing resources, by keeping the processor
idle while waiting for input and output to complete. This is how computers
were initially used, until it was realized that by running mixes of concurrent
jobs, the computer could be better utilized.

>All the above is based on NON-network based
>applications.  How does this change when you application
>is related with I/O operations on sockets?

In a networked server application, you have requests arriving from multiple
users.  This is just a modern variant of programmers lining up at the data
centre window to submit punched cards. If the jobs are run one by one, you
waste the resources of the machine. Moreover, even if the resources of the
machine are not significantly wasted, when some programmer submits a very large
processing job, everyone behind has to wait for that big job to complete, even
if they just have little jobs. Moreover, they have to wait even if their
jobs are more important; there is no way to interrupt the big job to run
these more important jobs, and then resume the big job.

The same observations still hold true of in a networked server. If you handle
all of the requests serially, you don't make good use of the resources.  You
don't juggle enough concurrent I/O requests to keep the available peripherals
busy, and idle the processor.  Moreover, if a big request comes in that takes a
long time to complete, the processing of additional requests grinds to a halt.

=================================TOP===============================
 Q283: What if two threads attempt to join the same thread?  

On Fri, 18 Feb 2000 22:45:17 GMT, Jason Nye  wrote:
>Hello, all
>
>If a thread, say tX, is running (joinable) and both thread tY and tZ attempt
>to join it, what is the correct behaviour of pthread_join:
 
Any behavior is correct, because the behavior is undefined. A thread may be
joined by only one other thread.   Ammong acceptable behaviors would be
that of your program terminating with a diagnostic message, or behaving
unpredictably.

=================================TOP===============================
 Q284: Questions with regards to Linux OS?  



>    I have some basic questions with regards to Linux OS
>    1) What types of threads (kernel/user space) and Bottom-Handler can
>exist inside a task-list??

Both kernel and user space threads can go into a *wait queue*.

A *task queue*, though unfortunately named, is something else. A task queue
basically has lists of callbacks that are called at various times. These are
not threads. 

>    2) Can I add a user space thread to a task-list?

You cannot add threads to task queues. You can use a wait queue to block a
thread. This is done by adding a thread to the wait queue, changing its state
to something like TASK_INTERRUPTIBLE (interruptible sleep) and calling
schedule() or schedule_timeout().

>    3) I would like to change a thread's priority within a task-list
>from a bottom handler. How can I do it?

With great difficulty, I suspect. You might be better off just having the
thread adjust its priority just before sleeping on that queue.

=================================TOP===============================
 Q285: I need to create about 5000 threads?  

Efremov Stanislav wrote:
> 
> I need to create about 5000 threads simultaneously (it'll be a powerful
> server on NT machine, each client invokes it's own thread)
> 
> Did anybody write programs like this? Is it a good idea to invoke a thread
> for each connection?

    No, it's really bad.

> I should write this program on Java. Can you say also, can it be implemented
> with so many threads?

    Not likely.

> I'm really appreciate your opinion. Thanks in advance.

    Use a more rational design approach. For NT, completion ports would be
a good idea.

    DS
=================================TOP===============================
 Q286:  Can I catch an exception thrown by a slave thread?

Jan Koehnlein wrote:
> 
> Hi,
> 
> does anyone know if it's possible to catch an exception thrown by a
> slave thread in the master thread using C++?
> 

Yes. But you need some extra infrastructure.

You can do this with RogueWave Threads.h++ using what is called an IOU;
I believe also that the ACE Toolkit may implement something similar that
is called a future (but I havn't looked into that aspect of ACE).

Basically an IOU is a placeholder for a result that is computed
asynchronously. To get the result you redeem the IOU. Then you may:

1) get the result that was previously computed,
2) block, if the result isn't yet available.
3) see an exception raised, if the async thread threw one.

The implementation catches the exception in the async thread and copies
it into a location where the IOU can see it. On redemption it is thrown.

Regards, John.
=================================TOP===============================
 Q287: _beginthread() versus CreateThread()?  

In article <38be0349$0$18568@proctor>, lee  wrote:

% 1.   Why should I use _beginthread() instead of CreateThread() when using
% the c runtime libs ?

Because the people who wrote the library said so. The documented effects
of not using _beginthread() is that you can have per-thread memory leaks,
but I always like to think that the next release will have some catastrophic
problem if you don't do things their way.

% 2.    What can i use the saved thread_id for ? (as opposed to using the
% handle to the thread)

Some functions take a thread id (postthreadmessage comes to mind),
so sometimes you need that. I like to close handles as soon as possible,
so I don't have to keep track of them.

As I recall, you were taking some steps to set up a variable to hold
the thread ID, but you weren't setting it up correctly. That would be
the only reason I mentioned it. If you want to pass NULL, then just
pass NULL.

--

Patrick TJ McPhee
East York  Canada
[email protected]

>thanks - still got a few questions though
>... (forgive me if these questions are stupid - still very much a newbie)
>1.   Why should I use _beginthread() instead of CreateThread() when using
>the c runtime libs ?

First of all, you should not use _beginthread() but _beginthreadex().
The _beginthread() function is completely brain-damaged and should
not beused.

The answer to your question is somewhat involved. 

Some standard C functions have interface semantics that are inherently
non-reentrant, and require thread local storage in order to work reasonably
in a treaded environment.

The _beginthread() function is a wrapper around CreateThread which diverts the
newly created thread to a startup function within the C library. This startup
function allocates thread local resources before calling your start function.
More importantly, the function cleans up these thread local resources when your
start function returns.

If you use CreateThread, control is not diverted through the special start
function, so that if your thread also uses the standard library, causing it to
acquire thread local storage, that storage will not be cleaned up when the
thread terminates, resulting in a storage leak.

At the heart of the problem is Microsoft's brain damaged interface for managing
thread local storage, which doesn't permit a destructor function to be
associated with a thread local object.  Thus if a library needs to be notified
of a terminating thread so it can clean up its thread local resources, it needs
to either provide its own thread creating function that the client must use for
all threads that enter that library; or it must provide explicit thread attach
and detach functions (like COM's CoInitialize and CoUninitialize); or it must
be dynamically linked and use the thread destruction notifications passed
through DllMain.

A related problem is that Microsoft does not view the standard C library as
being an integral component of the Win32 system interface, but it is rather an
add on for Visual C. Thus the Win32 core is not ``aware'' of the C library. 

>2.    What can i use the saved thread_id for ? (as opposed to using the
>handle to the thread)

The handle is much more important; for one thing, it lets you wait on the
thread termination. The _beginthread function calls CreateThread internally
and then immediately closes the handle. The _beginthreadex function
casts the handle to an unsigned long and returns it.

=================================TOP===============
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


=================================TOP===============================
 Q288: Is there a select() call in Java??  

Not in 1.2. Maybe in the future? 

> Yes, I was just looking at that JSR. It isn't clear if that
> includes select/poll or not.

I submitted a (pretty idiotic) comment on the JSR, and got a very polite
reply from Mark Reinhold that, while it wasn't heavy on detail, was enough
to convince me, when I though about it, that they've got a pretty neat
design that can efficiently subsume select/poll, asynch I/O, SIGIO et al,
and does so much more tidily than anything that I had dreamed up.

> Do you have any pointers to poll() & Java?

The Developers Guide (PostScript document) that comes with JDK 1.2.1_04 (for 
all I know, possibly other versions too, e.g. 1.2.1_03 or 1.2.2_05) talks 
about it, and one of the four packages in that release is SUNWj2dem, which
contains the source code (Java and C) for the demo poller code.  Be warned
that it is Solaris-specific and the antithesis of pure Java...




Hi Bil,

Hope you enjoyed the tip, I really enjoyed your book.  After spending a fair
amount of time playing with InterruptedException, I view interruption as
just another kind of signal.  I almost never use it for interruption per se,
but I have wondered about using it as a "notify with memory," so that even
if the thread isn't waiting right now it can still get the message.

Are you involved with agitating for select() and better-defined
InterruptedIOInterruptions in a future version of Java?  I'll sign the
petition. :-)

-Stu
http://staff.develop.com/halloway




=================================TOP===============================
 Q289: Comment on use of VOLATILE in the JLS.?  

>It is my opinion and my experience that inclusion
>of VOLATILE in Java has lead to nothing but confusion
>and bad coding practices.

*Personally*, I completely agree.


>ANYWAY, I think it would be of some value to include a
>statement of where VOLATILE is and isn't useful and
>examples of same. In particular, that VOLATILE is almost
>always the WRONG thing to use, and that programmers
>should avoid it unless they have a VERY clear understanding
>of the consequences.

We might be able to squeeze something in. However, bear in mind that the 
primary purpose of the JLS is to specify the language semantics, not to 
teach people how to use it.

[Some time after this exchange, the issue of the description of the
memory model required by Java popped up led by Lea & Pugh. The gist
of this is that a pile of details will be fixed AND VOLATILE will be
given more adaquate semantics, making it POSSIBLE to use correctly.
It will still be *VERY* difficult and should still be avoided.]

*********************************************
Gilad Bracha
Computational Theologist
Sun Java Software
http://java.sun.com/people/gbracha/



=================================TOP===============================
 Q290:  Should I try to avoid GC by pooling objects myself??  

[From a discussion in Java Report]

Dear Dwight, Dr. Kolawa,

In his article, Dr. Kolawa made a number of good points, but he also
said one thing that drives me crazy. I've seen different people give
different versions of this time and time again and it's just not good.
We'll call this "Fear of Garbage."

This is the preoccupation that so many C and C++ programmers
have with memory usage. In those languages, this concern is well-
founded, but with garbage collectors this concern becomes moot.
Dr. K suggests setting a variable to NULL to allow the GC to collect
garbage earlier. While it is certainly true that eliminating one
pointer to an object will make it more likely to be collected earlier,
it's the wrong thing to do.

The whole idea of GCs is that you DON'T spend your time worrying
about temporary usage of memory. Yes, you may indeed increase the
footprint of your program a bit by not worrying about every reference,
and you can always invent a worst case such as in his article, but
normal programming practices will avoid even these. If he had written
his example like this:

displayMomentarily(makeSplashImage());

that monster splash image would have been GC'd as it went out of
scope naturally.

Now it is possible to write programs that stuff more and more data
onto a list, data that will never be used again. And that IS a memory
leak and you do have to avoid doing that. Infinite history lists for
example. But that's a different problem. Dr. K is referring to an
issue concerning singletons which just isn't a problem.

In brief, "If it ain't broke, don't fix it."

-Bil
-- 
================



=================================TOP===============================
 Q291:  Does thr_X return errno values? What's errno set to???  

> 
> Some of the thr_X() man pages seem to say the functions return the errno
> value.
> 
> Is this really correct (instead of returning, say, -1 and then setting
> errno)?
> 
> If it is correct, is errno also set correctly?
> 

Yes, that is correct. They return the error value. And errno is "set
correctly" -- meaning that it has no defined value because it isn't
involved in this API. NB: As a side-effect of some functions, errno
WILL have a value on some systems. But it ain't defined, so don't use it.



After responding to your message I went to your website. I am worring that
if this is the message you are giving your students you are doing them
disservice. They will never be able to write real applications in java.

For the classes I teach in threading (both POSIX and Java),
I found it useful to have some example programs for the students
to build upon. One of these programs is a server program which
accepts requests on sockets.

I have just finished polishing up the robust, select() version
of the server (both POSIX and Java) and would love to have folks
take a look at it.

-Bil


            POSIX

There are four server programs, each accepts clients on a port 
and handles any number of requests (each request is a byte string
of 70 characters).

> There is the simple server, which is single-threaded.

> There is the master/slave server, which is multithreaded and spawns
  one thread to handle each request.

> There is the producer/consumer server, which is multithreaded, but
  spawns off a new thread to receive requests from each new client
  (replies are done by the pool of consumer threads).

> Finally, there is the select server, which is multithreaded and 
  which has only a single thread doing all of the receiving. The 
  producer thread does a select() on all current clients AND the
  port AND an "interruptor" pipe. When select() returns:

  o Requests from clients go onto a queue for the consumer threads to
    handle. 

  o New connections to the port are accept()'d and new clients are
    created. 

  o Finally, if it is "shutdown" time, a message is sent on the interruptor
    pipe and everyone finishes up and stops.

  This program handles 1k clients, survives client failure, and reliably
  shuts down. (At least it works when *I* test it!)


All these programs have been tested only on Solaris 2.6 on an SS4, but
should run on any UNIX system. The code (along with a pile of other
programs) is located at: 

http://www.lambdacs.com/code/programs_21_Mar_00.tar.gz

in the directory programs/PThreads, there is a Makefile for Solaris
which gets regular use, also Makefiles for DEC, HP, and SGI, which
get a lot less use. Hence:

bil@cloudbase[182]: make

bil@cloudbase[183]: setenv DEBUG    (If you want LOTS of output)

bil@cloudbase[184]:  server_select 6500 100 0 10 30 &
Server_9206(TCP_PORT 6500 SLEEP 100ms SPIN 0us N_CONSUMERS 10 STOPPER 30s KILLER
-1s)
Starting up interation 0.
Server bound to port: 6500
Server up on port 6500. Processed 669 requests. 41 currently connected clients.
Time to stop!
Shutdown successful. 676 replies sent.

...

bil@cloudbase[185]: client 6500 1 1 50      (Better in different window!)
Client_9207(PORT 6500 SLEEP 1ms SPIN 1us N_SOCKETS 50 N_REQUESTS 10000)
Connected to server on port 6500
Connected to server on port 6500
Connected to server on port 6500
Connected to server on port 6500
Client_9207[T@9]    Receiving segments on fd#6...
Client_9207[T@8]    Sending 10000 requests on fd#6...
Client_9207[T@7]    Receiving segments on fd#5...
Client_9207[T@6]    Sending 10000 requests on fd#5...
Client_9207[T@5]    Receiving segments on fd#4...
Client_9207[T@4]    Sending 10000 requests on fd#4...




            Java

The Java program is quite similar and even uses much of the same
C code for select(). (Java does not have an API for select(), so
we have to use the native select() via JNI.) The Java server is
happy to run with the C client & vice-versa.

bil@cloudbase[192]: cd programs/Java/ServerSelect
bil@cloudbase[193]: setenv THREADS_FLAG native
bil@cloudbase[194]: setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:.
bil@cloudbase[195]: setenv CLASSPATH ../Extensions/classes:.

bil@cloudbase[198]: java Server 6500 100 0 10 30
Server(port: 6500 delay: 100ms spin: 0us nConsumers: 10 stopperTimeout 30s)
Server now listening on port 6501
Server up on port 6501. Processed 2113 requests. 10 clients.

Stopping...
Everything stopped.
...



bil@cloudbase[303]: java Client 6500 100 0 10
Client(port: 6500 sDelay: 100 (ms) rDelay: 0 (ms) nClients: 10)

Actual Port: 6501
Client[Thread-0]    Started new sender.
Client[Thread-1]    Started new receiver.
Client[Thread-2]    Started new sender.
Client[Thread-3]    Started new receiver.
...
Client[Thread-0]    Sent: 100 requests.
Client[Thread-6]    Sent: 100 requests.
Client[Thread-8]    Sent: 100 requests.
Client[Thread-14]   Sent: 100 requests.
Client[Thread-4]    Sent: 100 requests.
...


=================================TOP===============================
 Q292: How I can wait more then one condition variable in one place?  



Condition variables are merely devices to delay the execution of a thread.  You
don't need more than one of these in order to accomplish that task.

Threads that use condition variables actually wait for a predicate to become
true; the evaluation of that predicate is done explicitly in the code, while
holding a mutex, e.g.


    lock mutex

    ...

    while not predicate()
    wait (condition, mutex )

    ...

    unlock mutex

You can easily wait on an array of predicates, provided that they are
protected by the same mutex.  Simply loop over all of them, and if
they are all false, wait on the condition. If some of them are true,
the perform the associated servicing.

>I need pthread analog  for WaitForMultipleObjects (WIN32)  or
>DosWaitMuxWaitSem (OS2)

You should not need this, unless you (or someone else) made the design mistake
of allowing operating system synchronization objects to be used as an interface
mechanism between program components.

If two unrelated objects, say A and B, are both to generate independent events
which must give rise to some processing in a third object C, which has
its own thread then there is a need for C to be able to wait for a signal
from either A or B.   

The programmer who is oblivious to issues of portability might arrange for
object C to expose two operating system objects, such as Win32 events; have A
and B signal these; and have C do a wait on both objects.

A technique which is easier to port is to use only the programming language
alone to make an interface between A, B and C.  When A and B want to signal C,
they call some appropriate interface methods on C rather than invoking
functions in the operating system specific library. These methods can be
implemented in a number of ways. For example, on a POSIX thread platform, they
can lock a mutex, set one of two flags, unlock a mutex and hit a condition
variable.  The thread inside C can lock the mutex, check both flags and use a
single condition variable to suspend itself if both flags are false.  Even on
Windows, you could handle this case without two events: use a critical section
to protect the flags, and have the thread wait on a single auto-reset event.

There is really no need for WaitForMultipleObjects in a reasonable design.
I've never needed to use it in Win32 programming despite having written plenty
of code that must respond to multiple stimuli coming from different sources. 


On Tue, 11 Jul 2000 14:05:40 -0700, Emmanuel Mogenet 
wrote:
>This seems to be a favorite topic (shouldn't be in the FAQ's or something),
>but could someone please elaborate on the following questions:
>
>    1. If I had to implement WaitForMultipleObjects on top of pthreads
>conditions, how would I go about it

One easy way would be to write a miniature kernel which manages the objects
to be waited on. This kernel would protect itself with a mutex. A condition
variable would be used to delay the execution of each incoming thread until
the wait condition is satisfied (either all of the desired objects are in
the ``signalled'' state, or just one or more of them is in that state,
depending on the arguments to the wait):

    lock(mutex)

    while ( none of the objects are signalled )
    wait(mutex, condition)

    unlock(mutex)

To do a more efficient job, you need more condition variables, so you don't
wake up too many threads. You could assign one condition variable to each
thread, or you could take some intermediate approach: have a pool of condition
variables. The wait grabs a condition from the pool and registers it to wait.
When an object is signalled, the implementation then hunts down all of the
condition variables that are engaged in a wait on that object, and signals
them, thereby waking up just the threads that are on that object.

>    2. People seem to consider that WaitForMultipleObjects to be an
>ill-designedAPI, however it looks to
>        me like the semantics is very close to that of select which is to my
>knowledge considered
>        pretty useful by most UNIX programmers.

Yes, however select is for multiplexing I/O, not for thread synchronization.
Also note that select and poll ensure fairness; they are capable of reporting
exactly which objects reported true. Whereas WaitForSingleObject returns the
identity of just one object, so the app has to interrogate the state of each
object with wasteful system calls. The WaitForSingleObject function is also
restricted to 64 handles, whereas select and poll implementations can take
thousands.


%     1. If I had to implement WaitForMultipleObjects on top of pthreads
% conditions, how would I go about it

In general, you can wait for N items by kicking off N threads to do the
waiting and signalling from those waiters.

If you're waiting for events which will be signaled through CVs, and you
control the code for these things, have them all signal the same CV. You
can still test for a lot of things:
 pthread_mutex_lock(&mux;);
 while (!a && !b && !c &&!d)
   pthread_cond_wait(&cv;, &mux;);
 pthread_mutex_unlock(&mux;);

If you're waiting for things which can be treated as file handles, you
can use poll() or select().


%     2. People seem to consider that WaitForMultipleObjects to be an
% ill-designedAPI, however it looks to
%         me like the semantics is very close to that of select which is to my
% knowledge considered
%         pretty useful by most UNIX programmers.

Like select(), it places an arbitrary limit on the number of things you can
wait for, so it can be useful as long as your needs don't go beyond those
limits. I think some people don't see the point of this in a multi-threaded
program.




=================================TOP===============================
 Q293: Details on MT_hot malloc()?  

There are a number of malloc() implemenations which scale better
than the simple, globally locked version used in Solaris 2.5 and
earlier. A good reference is:

http://www.ddj.com/articles/2001/0107/0107toc.htm

Some comments by the author:

If you quote them in the FAQ, make sure to make a note that these
opinions are my personal ones, not my employer's.

As I tried to describe in my DDJ article, there is no best malloc
for all cases.

To get the best version, I advise the application developers
to try different versions of mt-hot malloc with their specific
app and typical usage patterns and then select the version working
best for their case.

There are many mt-hot malloc implementations available now. Here
are my comments about some of them.

* My mt-hot malloc as described in the DDJ article and the patent.

It was developed first chronologically (as far as I know). It works
well when the malloc-hungry threads mostly use their own memory.
It also uses a few other assumptions described in my DDJ paper.

The main malloc algorithm in my mt-hot malloc is the same binary
search tree algorithm used in the default Solaris libc malloc(3C).

* mtmalloc(3t) introduced in Solaris 7.

I can't comment on this version, other than to say that it's
totally different from my mt-hot malloc implementation.

* Hoard malloc

It's famous, but my test (described in the DDJ article) did not
scale with Hoard malloc at all. It appeas that their realloc()
implementation is broken; at least it was in the version available
at the time of my testing (spring 2001). I've heard reports from
some Performance Computing people (who use Fortran 90 and no
realloc()) that Hoard malloc has helped their scalability very well.

Also, IMHO the Hoard malloc is too complicated, at least for the
simple cases using the assumptions described in my DDJ article.

* ptmalloc (a part of GNU libc)

I have not tested ptmalloc, so I can't comment on it.

* Smart Heap/SMP from MicroQuill

My tests of Smart Heap/SMP were not successful.

-Greg Nakhimovsky
=================================TOP===============================
 Q294: Bug in Bil's condWait()?  

In my Java Threads book I talk about how you can create
explicit mutexes and condition variables that behave like
POSIX. I note that you'll probably never use these, but it's
useful to think about how to build them and how they work.

Later on, I talk about the complexities of InterruptedException
and how to handle it. It's a tricky little devil. One of the
possible approaches I mention is to refuse to handle it at
all, but rather catch it, then re-interrupt yourself when
leaving your method. Hence allowing another method to see it.

A fine idea. And most of my code for this is correct.  Richard Carver
(George Mason University in Fairfax, VA, (where I grew up!))
pointed out a cute little bug in one bit of my code.

Here it is:

I wrote condWait like this:


public void condWait(Mutex mutex) {
  boolean       interrupted = false;


  synchronized (this) {
    mutex.unlock();
    while (true) {
      try {
        wait();
        break;
      }
      catch (InterruptedException ie) {interrupted=true;}
    }
  }

  mutex.lock();
  if (interrupted) Thread.currentThread().interrupt();
}


which is BASICALLY correct. If you get interrupted, you set a
flag, then go back and wait again. When you get signaled, you
call interrupt() on yourself and all is well.

UNLESS... You get interrupted, you wait to get the synchonization
lock AND just at that moment, someone calls condSignal() on the
CV. Guess what! You're not on the CV's sleep queue anymore and
you miss the signal! Bummer!

Of course not only is the unlikely to happen, it WON'T happen at
all on JDK 1.1.7. But it could if JDK 1.1.7 had been written
differently and other JVMs are.

ANYWAY, if you followed all that (and if you find this interesting)
here's the solution. The first interrupt will be treated as a
spurious wakeup, but it won't repeat. (Unless I've missed
something else!)


public void condWait(Mutex mutex) {
  boolean       interrupted = false;

  if (Thread.interrupted()) interrupted=true;

  synchronized (this) {
    mutex.unlock();
    try {
      wait();
    }
    catch (InterruptedException ie) {interrupted=true;}
  }

  mutex.lock();
  if (interrupted) Thread.currentThread().interrupt();
}



--

=================================TOP===============================
 Q295:  Is STL considered thread safe??  

This should probably be a FAQ entry.  Here's the answer I gave 2 months ago
to a similar question:

In general, the desired behavior is that it's up to you to make your
explicit operations on containers, iterators, etc thread safe.  This is good
because you might have several containers all synchronized using the same
locking construct, so if the implementation used individual locks underneath
it would be both wrong and expensive.  On the other hand, implicit
operations on the containers should be thread safe since you can't control
them.  Typically these involve memory allocation.  Some versions of the STL
follow these guidelines.

Look at the design notes at
http://www.sgi.com/Technology/STL/thread_safety.html

Jeff

---------

There is not such thing as *the* STL library. It is an abstract interface
defined in the C++ standard. 

There is no mention of threads in the C++ standard, so the STL is not required
to be thread safe.

To find out whether your local implementation of the STL is thread safe,
consult your compiler documentation.

For an STL implementation to be useful in a multithreaded programming
environment, it simply has to ensure that accesses to distinct containers
do not interfere. The application can ensure that if two or more threads
want to access the same container, they use a lock.

I believe that the SGI and recent versions of the Plauger STL (used
by VC++) are safe in this way.


Hi Cheng,

I'm going to post yet another answer: the term 'Thread-safe' is usually a very
difficult term to understand completely. There is absolutely no way to guarantee
that a given library/software package is 100% thread safe because it all depends
on how you use it.

An example of what I mean is shown below:

class Point2D {
public:
    ...

    void setX(double value)
    {
        lock.acquire();
        _x = value;
        lock.release();
    }

    void setY(double value)
    {
        lock.acquire();
        _y = value;
        lock.release();
    }

    double x() const
    {
        double tmp;
        lock.acquire();
        tmp = _x;
        lock.release();
        return tmp;
    }

    double y() const
    {
        double tmp;
        lock.acquire();
        tmp = _y;
        lock.release();
        return tmp;
    }

private:
    mutable Mutex lock;
    double _x, _y;
};

While the above code can be considered 'thread-safe' to a certain extent, it is
possible for it to be used incorrectly. An example is if one thread wants to move
the point (we'll call it 'pt' here):

    pt.setX(100.0);
    pt.setY(20.0);

The Point2D code guarantees that if another thread happens to look at pt's values
that it will receive well defined values and if another thread modifies the
values that it will be blocked appropriately and the two threads will not clobber
the pt object. BUT.... The above two lines do NOT guarantee that the update of
the point is automic which in the case of the above example is more important
than the Point2D being thread-safe. We can change Point2D to have a set(double x,
double y) and a get(double & x, double & y), but these are awkward and they make
the Point2D aware of threads when it should not be aware of them at all.
Therefore, in my opinion, the best design to overcome all the above problems is
to use a Point2D class that contains no locks and we use an externally associated
lock to guard the Point2D object. This way, Point2D is useful in all types of
applications -- including non-threaded applications AND we have the ability to
lock a set of operations on the object to make them appear automic
(transaction-like, if you will).

That being said, here is an example of how I would use the Point2D object (we'll
use the same class declaration as above, minus the lock):

class Point2D {
public:
    ...

    void setX(double value)
    {
        _x = value;
    }

    void setY(double value)
    {
        _y = value;
    }

    double x() const
    {
        return _x;
    }

    double y() const
    {
        return _y;
    }

private:
    double _x, _y;
};

Now, for the usage:

    // This thread (Thread A) updates the object:
    ....
    ptLock.acquire();
    pt.setX(100.0);
    pt.setY(20.0);
    ptLock.release();
    ....

    // This thread (Thread B) reads the information:
    ...
    ptLock.acquire();
    if (pt.x() > 10.0)
        // Do something rather uninteresting...
    if (pt.y() < 10.0)
        // Do something else rather uninteresting...
    ptLock.release();
    ...

Now, both the lock and the Point2D object are shared between two threads and the
above modification of the pt instance is seen as automic -- there is no chance
for a thread to view that x has been updated but y has not.

*PHEW*. All that being said, it may be clear now that when writing an
implementation of the STL, it is a good idea to consider threading as little as
possible. Usually, the only considerations that should be made are to ensure that
all functions are reentrant and that threads working on different instances of
containers will not clobber each other since they are not sharing any data --
this is usually achieved by making sure there are no static members for STL
containers. Some poor implementations of the commonly used rb_tree implementation
use static members and a 'large' mutex that causes no end of problems (anywhere
from link errors to much unnecessary overhead). A good implementation of the STL
should use 0 locks. Remember that the STL is a set of containers and algorithms.
It was correctly left up to the user of the STL to implement locking so they can
do it in the way they see fit for the problem they are solving. BTW, SGI has an
excellent implementation of the STL and they explain their design decisions on
their STL page (you can find it on their site).

Hope this provides some insight,
Jason

=================================TOP===============================
 Q296:  To mutex or not to mutex an int global variable ??  

Frank,

Nice *idea*, but no marbles. :-(

Lots of variations of this idea are proposed regularly, starting with
Dekker's algorithm and going on and on. The problem is that you
are assuming that writes from the CPU are arriving in main memory
in order. They may not.

On all high performance CPUs today (SPARC, Alpha, PA-RISC, MIPS,
x86, etc.) out-of-order writes are allowed. Hence in your example below,
it is POSSIBLE that only one of the values will be updated before the
pointer is swapped.

Bummer, eh?

-Bil

"Use the mutex, Luke!"


> To avoid locks you might try the following trick:
> Replace the two ints by a struct that contains total and fraction. The global
> variable would then be a pointer to the struct. To modify the variables, the writer
> would use a temporary struct, update the value(s) and then swap the pointers of the
> global pointer and the temporary global pointer. This works assuming that a pointer
> write is an atomic operation (which is the case in all architectures I know).


Hi,
    I don't know if you are still the maintainer of
the comp.programming.threads FAQ but I was reading Q63
trying to find a good way of using threads in C++ and
the suggestions were really good, but seemed to be from
a windows perspective.  It took me a little while to
translate what was there to something that I could use
and it is pretty much the same thing but here it is
just in case you wanted to include it in the FAQ:

class PThread {
public:
  PThread() { pthread_create(&thr;, NULL, (void *(*)(void *))thread, this); }
  static void thread(PThread *threadptr) { threadptr->entrypoint(); }
  void join() { pthread_join(thr, NULL); }
  virtual void entrypoint() = 0;
private:
  pthread_t thr;
};


--------------------------
David F. Newman
[email protected]
--------------------------
If you think C++ is not overly complicated, just what is a protected
abstract virtual base pure virtual private destructor, and when
was the last time you needed one?
                -- Tom Cargil, C++ Journal.


=================================TOP===============================
 Q297: Stack overflow problem ?  

BL> Yes, as far as I know EVERY OS has a guard page. [...]
BL> Be aware that if you have a stack frame which is larger than a 
BL> page (typically 8k), it is POSSIBLE to jump right over the guard
BL> page and not see a SEGV right away.

KK> The solution to that is to initialize your locals before calling
KK> lower-level functions. [...]

This is an inadequate solution for the simple reason that one isn't
guaranteed that variables with automatic storage duration are in any
particular order on the stack.  The initialisations themselves could
cause out of order references to areas beyond the guard page, depending
from how the compiler chooses to order the storage for the variables.

The correct solution is to use a compiler that generates code to perform
stack probes in function prologues.

For example: All of the commercial C/C++ compilers for 32-bit OS/2
except for Borland's (i.e. Watcom's, IBM's, and MetaWare's) generate
code that does stack probes, since 32-bit OS/2 uses guard pages and a
commit-on-demand stack.  In the function prologue, the generated code
will probe the stack frame area at intervals of 4KiB starting from the
top, in order to ensure that the guard pages are faulted in in the
correct order.  (Such code is only generated for functions where the
automatic storage requirements exceed 4KiB, of course.)

Microsoft Visual C/C++ for Win32 generates stack probe code too.  This
is controlled with the /Ge and /Gs options.  

As far as GCC is concerned, I know that EMX C/C++ for OS/2, the original
GCC port for OS/2, had the -mprobe option, although I gather that in
later ports such as PGCC this has been superceded by the
-mstack-arg-probe option.  (See http://goof.com/pcg/os2/differences.html
.) Whether there is the same or an equivalent option in the GCC ports to
other platforms I don't know. 



=================================TOP===============================
 Q298: How would you allow the other threads to continue using a "forgotten" lock?  

Bhavin Shah wrote:
> 
> Hi,
> 
> Sorry if this is a faq, but I haven't found a solution yet:
> In a multi-threaded app, say you have a thread acquire a lock,
> update global variables, then release the lock.  What if for
> some reason (granted that updating a couple variables shouldn't
> take much time), the thread crashes before releasing the lock?
> I don't see a way to set a timeout on acquiring locks.  How would
> you allow the other threads to continue using that same lock?

    You wouldn't.  What's more, you shouldn't!  The dead thread
holding the lock may have left the lock-protected data structures
in an inconsistent or incorrect state, in effect planting a
poison pill for any other thread which might attempt to use
them later on.  "Overriding" a lock is the same as "ignoring" a
lock -- you know that the latter is dangerous, so you should
also understand that the former is equally dangerous.

    There's another peculiar notion in your question, too: that
of "a thread crashing."  The fact that Thread A takes a bus
error or something does *not* mean that Thread A was "at fault,"
and does *not* mean that Threads B-Z are "healthy."  If any
thread at all gets into trouble, you should start with the
supposition that the entire program is in trouble; you shouldn't
think in terms of getting rid of the "offending" thread and
trying to carry on as usual.  Poor but possibly helpful analogy:
if a single-threaded program crashes in Function F(), would it
be appropriate to replace F() with a no-op and keep on going?


=================================TOP===============================
 Q299: How unfair are mutexes allowed to be?  

> Hi
>
>   If several threads are trying to gain access through a mutex,
> how unfair are mutexes allowed to be?  Is there any requirement
> for fairness at all (ie that no threads will be left unluckily starving
> while others get access).
>
> The code needs to work on many posix platforms, are any of them
> unfair enough to merit building an explicit queueing construct
> like ACE's Token?
>
> thanks,
>
> Jeff

Assume the worst and you'll be safe (and probably correct). If your
program looks like this:

       T1
while (true) {
  lock()
  i++
  unlock()
}

     T2
while (true) {
  do stuff for 10ms()

  lock()
  if (i == N) do other stuff()
  unlock()
}

You can be fairly certain T2 ain't never gonna get that lock more
than once every 1,000+ interations. But this is pretty fake code and
T1 shouldn't look like that. T1 should also "do stuff for Xms" outside
of the critical section, in which case you're safe.

Think about it: The only time there's a problem is when a thread
keeps the mutex for all but a tiny fraction of the time (like T1). And
that would be odd.

IMHO

-Bil
 

=================================TOP===============================
 Q300: Additionally, what is the difference between -lpthread and -pthread? ?  

On 01 Aug 2000 15:01:23 -0500, Aseem Asthana  wrote:
>
>
>Hi,
>
>>That is incorrect. With modern gcc, all you need is the -pthread option.
>>If your gcc doesn't recognize that, then you need:
>
>>    gcc -o hello -D_REENTRANT hello.c -lpthread
>
>>Without the _REENTRANT, certain library features may work incorrectly
>>under threads.
>
>the -D_REENTRANT option, is it to link your programs with the thread safe
>standard libraries, or something else?

No, the -D_REENTRANT option makes the preprocessor symbol _REENTRANT known
to your system headers. Some code in the system headers may behave differently
when _REENTRANT is defined. For example, quite typically, the errno macro
has a different definition in threaded programs in order to give each thread
its own private errno. Without _REENTRANT, accesses to errno might go to one
global one, resulting in race conditions.

>Additionally, what is the difference between -lpthread and -pthread?

-lpthread is the directive to link in the library libpthread.a or a shared
version thereof.

-pthread is a special command line option supported by modern gcc, and
some other compilers as well, which sets up all the correct options for
multithreaded building, like -D_REENTRANT and -lpthread.

Subject: Re: problems with usleep()
Kauser Ali Karim wrote:
> 
> Hi,
> 
> I'm using the usleep() function to delay threads and I get a message:
> 
> Alarm clock
> 
> after which my program to exits prematurely since the p_thread_join that I
> call at the end is not executed.
> 
usleep requests the process gets woken up after a time.  This wakeup is
a 
SIGALRM, which is causing you prog to exit.

Use nanosleep, its is preferred for threaded apps.

    Ian


=================================TOP===============================
 Q301: Handling C++ exceptions in a multithreaded environment?  

On Fri, 4 Aug 2000 09:52:50 -0400, Bruce T  wrote:
>Hello,
>
>I am writing code in C++ in a multithreaded system.  Can anyone point me to
>any
>links or articles on any special strategies/concerns or examples of handling
>C++ exceptions in a multithreaded environment.

Assuming that your compiler supports thread-safe exception handling, you can
simply use the language feature as you normally would.

There aren't really any special concerns. (You wouldn't worry whether function
calls, returns or gotos are a problem under threading, right? So why fuss
over exceptions).

Avoid common misconceptions, like wanting to throw an exception from one thread
to another. This simply isn't a valid concept, since an exception is a
branching mechanism. Branching occurs within the thread of control, by
definition: it's a change in one thread's instruction pointer, so to speak.

(That's not strictly true: function calls *can* take place between threads,
processes or machines through remote procedure calling. An analogous mechanism
can be devised to pass exceptions; e.g. you catch the exception on the 
RPC server side, package it up, send a reply message to the client, which
unpacks it and rethrows.)



=================================TOP===============================
 Q302:  Pthreads on IRIX 6.4 question?   
X-Mozilla-Status2: 00000000

[email protected] wrote:
> 
> Hello,
> 
>    I am having problems with Pthreads on IRIX 6.4. I have two
> threads: the initial thread plus one that has been pthread_created.
> The pthread_created pthread does an ioctl and sits in a driver waiting
> for an
> event. While this is happening the "initial" thread should be eligible
> to run, but it is put to sleep, i.e. doesn't run. Why? On IRIX, what
>  kind of LWP notion
> is there?

Defaul scheduling scope is process on IRIX 6.4, and the number of
execution vehicles is determined by the pthread library -- typically,
you'll start with one execution vehicle unles sthe library detects all
your threads can run in parallel and consume CPU resources. 

But the latest pthread patches for IRIX 6.4 would, in my experience,
create an extra execution vehicle on the fly in the case you describe,
so I'd certainly recommend you to get the *LATEST* set of POSIX
recommended patches.

You can, in 6.4, use pthread_setconcurrency to give hints as to how many
kernel execution vehicles you want. You can also run with system scope
threads using pthread_attr_setscope (giving you one kernel execution
vehicle per thread), but on IRIX this requires CAP_SCHED_MGT
capabilities, as process scope threads in IRIX can schedule themselves
at higher priorities than some kernel threads (see man capabilities).

In 6.5.8, you have PTHREADS_SCOPE_BOUND_NP (incorrectly referred to as
PTHREADS_SCOPE_BOUND in the headers) scope, which gives you what Solaris
and Linux system scope threads are -- one execution vehicle per thread,
but no extra scheduling capabilities (hence no need to do fancy stuf
with capabilities to run this as non-root user); blocking in one thread
is guaranteed not to interfere with immediate availability of kernel
execution vehicles for other threads.
 

Frank Gerlach wrote:
> 
> I also had the problem of pseudo-parallelity on Solaris 2.6. Only after
> calling  pthread_attr_setscope() the threads would *really* execute in
> parallel. Maybe that helps with your problem..
> 
>              pthread_attr_init(&attr;);
>              pthread_attr_setscope(&attr;,PTHREAD_SCOPE_SYSTEM);
>              pthread_attr_setschedpolicy(&attr;,SCHED_OTHER);
>              int retv=pthread_create(&tids;[i],NULL,threadfunc,&ta;[i]);
> 

As I also said, on IRIX, the closest scope to this one is
PTHREAD_SCOPE_BOUND (actually, ...BOUND_NP, but the header is wrong in
6.5.8). PTHREAD_SCOPE_SYSTEM threads can do much more wrt scheduling in
IRIX, and as a result require CAP_SCHED_MGT capabilities.

*Latest* pthread patch sets, though, usually don't have too many
problems with making extra kernel execution vehicles to avoid deadlocks
(for normal process scope threads) -- in the words of the man page for
pthread_setconcurrency:


    Conversely the library will not permit changes to the concurrency level
to create starvation.  Should the application set the concurrency level to n
and then cause n threads to block in the kernel the library will activate
additional execution vehicles as needed to enable other threads to run.  In
this case the concurrency level is temporarily raised and will eventually
return to the requested level.


Earlier flavours of the pthread library may have more problems to
actually guess whether extra execution vehicles are needed.





=================================TOP===============================
 Q303: Threading library design question ?  

Some people say semaphores are the concurrent-processing equivalent of
GOTO....
Still, semaphores are very useful and sometimes even indispensible.
(IMO goto is sometimes also a good construct, e.g. in state machines)

A useful construct might be ReaderBlock and WriterBlock classes, which take
a ReadWriteMutex as a constructor argument and can be used similar to the
synchronize() construct of java. Those classes lock the mutex in the
constructor and unlock it in their destructor, avoiding explicit unlocking
AND exception handling easy. The latter is especially important, as I cannot
think of an elegant way to unlock a mutex in case an exception is thrown,
which will be handled in a calling method.

In general, one could provide a list of synchronization constructs in
ascending order of complextity/danger for novice users. The
Reader/Writerblock is quite harmless, even if you do not think about its
consequences.
Still, you can easily deadlock your program by using two mutexes and
acquiring them in opposite order.

My feeling is that concurrent programming contains inherent complexity,
which cannot be eliminated.

As a final input, automatic deadlock detection in debug mode would be a
simple, but great feeature for both C++ libs and Java VMs (unfortunately SUN
does not provide this in their VMs).

Beman Dawes wrote:

> There is discussion on the boost mailing list (www.boost.org) of design
> issues for a possible C++ threading library suitable for eventual
> inclusion in the C++ standard library.
>
> Some suggest starting the design with very low-level primitives, and
> then using these to build higher-level features.  But like a goto
> statement in programming languages, some low-level features can be
> error-prone and so should not always be exposed to users even if present
> in the underlying implementation.
>
> So here is a question where comp.programming.threads readers probably
> have valuable insights:
>
> What features should be excluded from a threading library because they
> are known to be error-prone or otherwise dangerous?  What are the
> threading equivalents of goto statements?
>
> --Beman Dawes 


=================================TOP===============================
 Q304:  Lock Free Queues?   

On Thu, 10 Aug 2000 03:55:25 GMT, J Wendel  wrote:
>
>
>
>I wonder if any of you smart quys would care to enlighten me
>about "lock free" algorithms. I've found several papers on
>the Web, but to be honest, I'm having a little trouble
>following the logic.

Lock free algorithms do actually rely on atomic instructions provided by the
hardware. So they are not exactly lock free.

For example, a lock-free queue can be implemented using an atomic compare-swap
instruction to do the pointer swizzling.

The idea is that the hardware provides you with a miniature critical region in
the form of a special instruction which allows you to examine a memory
location, compare it to a value that you supply, and then store a new value if
the comparison matches.  The instruction produces a result which tells you
whether or not the store took place.  The instruction cannot be interrupted,
and special hardware takes care that the memory can't be accessed by other
processors.

Here is an illustration. Suppose you want to push a new node onto the lock-free
list. How do you do that? Well, you set your new node's next pointer to point
to the current head node. Then you use the compare-swap to switch the head node
to point to your new node! If it succeeds, you are done. If it fails, it means
that someone else succeeded in pushing or popping before you were able to
execute the instruction. So you must simply loop around and try again.  The
subject of the comparison is simply to test whether the head node still has the
original value.

Pseudo code:

    do {
        node *head_copy = head;
        newnode->next = head;
    } while (!compare_and_swap(&head;, head_copy, newnode));

The compare_and_swap simply behaves like this, except that it's
implicitly atomic:

    int compare_and_swap(node **location, node *compare, node *newval)
    {
        /* lock whole system */

        if (*location == compare) {
            *location = newval;

            /* unlock whole system */
            return 1;   
        }

        /* unlock whole system */
        return 0;
    }

>Can someone explain why "lock free" algorithms don't seem to
>be in widespread use? I've got a work queue based server
>that would benefit from less locking overhead.

They are probably in more widespread use than you might suspect.  However,
there is no portable, standard interface for constructing these things. They
rely on support from the hardware which is not found on all architectures!

These kinds of techniques are more in the domain of the developers of operating
systems and system interface libraries who can use them to construct the
higher level synchronization primitives.

You might find these algorithms used in the implementation of mutexes and
other kinds of objects.
 


=================================TOP===============================
 Q305: Threading library design question ?  
[ OK, so I'm reading a little behind... ]

In article <[email protected]>, Beman Dawes   wrote:
>What features should be excluded from a threading library because they
>are known to be error-prone or otherwise dangerous?  What are the
>threading equivalents of goto statements?

I think the most commonly asked-for feature that shouldn't be in a thread
library is suspend/resume.  It's amazing how many people believe they
want to arbitrarily suspend another thread at some random place in its
execution.

Yes, there are things you can do with suspend/resume that are Very
Difficult without, but it's one of those places where the bugs that
can be introduced are very subtle.

For a quick example, suppose I suspend a thread that's inside a library
(say, stdio) and has a mutex locked (say, stdout or stderr).  Now nobody
can say anything without blocking on the mutex, and the mutex won't
come back until the thread is resumed.

Permute the above with a few dozen thread-aware libraries, and you get
into *serious* trouble.

The other thing that I hear a bunch (this may be unique to the embedded
real-time market that I play in) is to disable context switching from
user space.  It implodes instantly in the face of multi-processor
systems, and effectively elevates the calling thread to ultimate
priority.

But those are just my favorites.
-- 
Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 14.9"
 Internet: steve @ Watt.COM                         Whois: SW32
   Free time?  There's no such thing.  It just comes in varying prices... 



=================================TOP===============================
 Q306:  Stack size/overflow using threads ?  

In article <[email protected]>, Jason Jesso   wrote:
% -=-=-=-=-=-

% I just began writing a threaded program using pthreads on AIX 4.3 in C.
% 
% In a particular thread I create two 60K arrays as local variables.
% My program crashes in irregular places within this thread
% and I do believe in the "Principal of Proximity".
% 
% My hunch is stack corruption, since when I place "any" one of these two
% arrays as global my program runs fine.
% 
% Could it be possible that I am overflowing the stack space for this
% thread?

Yes. Threads typically have fixed-size stacks. If thread A has a stack
that starts at 0x400000 and thread B has a stack that starts at 0x500000,
then A's stack can't be any bigger than 0x100000, or else it would over-
write B's. POSIX doesn't specify a default stack size, and it varies
from system to system. You can set the stack size by calling
pthread_attr_setstacksize. You can find out the default stack size by
calling pthread_attr_getstacksize on a freshly initialised attr structure.

>From memory, AIX gives about 90k of stack by default, so you probably need
to knock it up a bit. Other systems have different limits. Solaris gives
1M, HP-UX 64k, TRU64 (sic) Unix gives ~20k, Linux gives 1M, and FreeBSD
gives 64k (again, this is working from memory, so don't rely on it).

--

Patrick TJ McPhee
East York  Canada 

Patrick TJ McPhee wrote:

> In article <[email protected]>, Jason Jesso   wrote:
> % -=-=-=-=-=-
>
> % I just began writing a threaded program using pthreads on AIX 4.3 in C.
> %
> % In a particular thread I create two 60K arrays as local variables.
> % My program crashes in irregular places within this thread
> % and I do believe in the "Principal of Proximity".
> %
> % My hunch is stack corruption, since when I place "any" one of these two
> % arrays as global my program runs fine.
> %
> % Could it be possible that I am overflowing the stack space for this
> % thread?
>
> Yes. Threads typically have fixed-size stacks. If thread A has a stack
> that starts at 0x400000 and thread B has a stack that starts at 0x500000,
> then A's stack can't be any bigger than 0x100000, or else it would over-
> write B's. POSIX doesn't specify a default stack size, and it varies
> from system to system. You can set the stack size by calling
> pthread_attr_setstacksize. You can find out the default stack size by
> calling pthread_attr_getstacksize on a freshly initialised attr structure.

The pthread_attr_getstacksize() is a nice trick, but it's not portable.
Unfortunately, while you're correct that POSIX doesn't specify a default stack
size, you're underestimating the true extent of the lack of specification. Far
beyond not specifying a default size, it doesn't even specify what the default
size MEANS. That is, POSIX never says that the default value of the stacksize
attribute is the number of bytes of stack that will be allocated to a thread
created using the default attributes. It says that there IS a default, and that,
if you ask, you'll get back a size_t integer. Because you're not allowed to set
any value smaller than PTHREAD_STACK_MIN, one can play tricks. Solaris, for
example, has a default value for the stacksize attribute of "0". But that doesn't
mean 0 bytes, it means "default". Nobody can actually ask for 0 bytes, so there's
no ambiguity when pthread_create() is called. One can call this "creative" or
"devious", but it's perfectly legal.

> From memory, AIX gives about 90k of stack by default, so you probably need
> to knock it up a bit. Other systems have different limits. Solaris gives
> 1M, HP-UX 64k, TRU64 (sic) Unix gives ~20k, Linux gives 1M, and FreeBSD
> gives 64k (again, this is working from memory, so don't rely on it).

Tru64 UNIX V5.0 and later gives 5Mb by default. Earlier versions, stuck without
kernel support for uncommitted memory, were forced to compromise with far smaller
defaults to avoid throwing away bushels of swap space. [It was actually more like
24Kb, I think, (which was actually just fine for the vast majority of threads),
but that's hardly relevant.]

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/
 

[email protected] wrote:

> Try adjusting the stack size for the thread:
>
>         static pthread_attr_t thread_stack_size;
>
>         pthread_attr_init (&thread;_stack_size);
>         pthread_attr_setstacksize (&thread;_stack_size, (size_t)81920);

I recommend NEVER using an absolute size for the stack. It's not portable, it's
not upwards compatible. It's just a number that really means practically nothing
-- and even less except on the exact software configuration you used to measure.
(And even then only as good as the accuracy and thoroughness of your
measurements... and measuring runtime stack depth is not easy unless your program
has only one straightline code path.)

Of course, you may not have much choice...

>         pthread_create (&thread;, &thread;_stack_size, thread_func, 0);
>
> You can check the default size with:
>         static pthread_attr_t thread_stack_size;
>
>         pthread_attr_init (&thread;_stack_size);
>         pthread_attr_getstacksize (&thread;_stack_size, &ssize;);
>
> I ran into this problem on Digital when I had a thread call a deeply nested
> function.  All auto-variables will be allocated from the thread's stack, and
> so it is a good idea to know how much memory your thread function will consume
> beforehand.  Hope this helps.

That works fine on Tru64 UNIX (or the older Digital UNIX and DEC OSF/1 releases),
but it's not portable, or "strictly conforming" POSIX. There's no definition of
what the default value of stacksize means. (An annoying loophole, but some
implementations have exploited it fully.)

In fact, on Tru64 UNIX, or on any implementation where you can get the default
stack size from pthread_attr_getstacksize, I recommend that you make any
adjustments (if you really need to make adjustments) based on that value. Not
quite big enough? Double it. Triple it. Square it. Whatever. If the thread
library suddenly starts using an extra page at the base of each stack, the
default stack size will probably be increased to keep pace -- your arbitrary
hardcoded constant won't change, and you'll be in trouble.

Furthermore, on Tru64 UNIX 5.0 and later, you'll be doing yourself a disservice
by setting a stack size. The default is 5Mb... and if THAT's not enough for you,
you need to have your algorithms examined. Solaris (and I believe Linux) use 1Mb,
which ought to be sufficient for most needs.

In general, be really, really careful about adjusting stack size. If you need to
increase any thread from the default, you should consider making it "as big as
you can stand". Recompilation of ANY code (yours or something in a system library
you use) could expand the size of your call stack any time you change the
configuration. (Installing a patch, for example.) Runtime timing variations could
also affect the depth of your call stack. Cutting it too close is a recipe for
disaster... now or, more likely, (because it's more "fun" for the computer that
way), sometime later.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/
 


=================================TOP===============================
 Q307:  correct pthread termination?   

sr wrote:

> I'm writing a multithreaded program under AIX 4.3.2, and noticed that
> all threads, whether or not created in detached state, once terminated

> correctly via a pthread_exit() call (after having freed any own
> resources), are still displayed by the ps -efml command until the main

> process terminates.

There's no requirement in POSIX or UNIX 98 that resources be freed at
any particular time. There's no way to force resources to be freed.

On the contrary, POSIX only places a requirement on you, the programmer,
to release your references to the resources (by detaching the thread) so
that the implementation is ABLE to free the resources (at some
unspecified and unbounded future time).

> Is this normal, or there are still resources allocated I'm not aware
> of?

Many implementations cache terminated threads so that it can create new
threads more quickly. A more useful test than the program you show would
be, after the "first round" of threads have terminated, to create a new
round of threads. Does AIX create yet more kernel threads, or does it
reuse the previously terminated threads?

> In the latter case, how do I make sure that a thread, when
> (gracefully) terminated, gets completely freed?

Why would you care? The answer is, there's no way to do this. More
importantly, it should make absolutely no difference to your application
(and very little difference to the system) unless AIX is failing to
reuse those terminated threads. (I would also expect that unused cached
threads would eventually time out, but there's no rule that they must.)

In general, my advice would be "don't worry about things you don't need
to worry about". If you're really sure you do need to worry, please
explain why. What you have described is just "a behavior"; not "a
problem". If you're sure that behavior represents a problem for you,
you'll need to explain why it's a problem. (And while we're all curious
out here, you might keep in mind that it'll do you more good to explain
the problem to IBM support channels.)

Oh, and just a few comments about your program:

While it's probably "OK" for your limited purposes, I can't look at a
threaded program where main() does a sleep() [or any kind of a timed
wait] and then calls exit() without cringing. If you don't need main()
to hang around for some real purpose, then it should terminate with
pthread_exit(). If you do need it to hang around for some reason, a
timed wait is nearly always the wrong way to make it hang around.

Secondly, while I understand the desire to provide thread identification
in your printout, you should be aware that this "(unsigned
long)pthread_self()" construct is bad practice, and definitely
unportable. The pthread_t type is opaque. While many implementations
make this either a small integer or a pointer, it could as easily be a
structure. Unfortunately, POSIX lacks any mechanism to portably identify
individual threads to humans (that's "debugging", which is out of
scope). I'm not saying "don't do it"; I just want to make sure you know
it's a platform dependent hack, not a legal or portable POSIX construct.

[[ This is another re-post of a response that was lost on the bad new
server. ]]

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/
 


=================================TOP===============================
 Q308:  volatile guarantees??   

On Wed, 30 Aug 2000 11:28:46 -0700, David Schwartz  wrote:
>
>Joerg Faschingbauer wrote:
>> 
>> [email protected] (Kaz Kylheku) writes:
>> 
>> > Under POSIX threads, you don't need volatile so long as you use the
>> > locking mechanism supplied by the interface.
>> 
>> How does pthread_mutex_(un)lock manage to get the registers flushed?
>
>    Who cares, it just does. POSIX requires it.

Everyone is saying that, but I've never seen a chapter and verse quote.
I'm not saying that I don't believe it or that it's not existing practice;
but just that maybe it's not adequately codified in the document.

To answer the question: how can it manage to get the registers flushed?
Whether or not the requirement is codified in the standard, it can can be met
in a number of ways. An easy way to meet the requirement is to spill
registers at each external function call.

Barring that, the pthread_mutex_lock functions could be specially recognized by
the compiler. They could be, for instance, implemented as inline functions
which contains special compiler directives which tell the compiler to avoid
caching.

The GNU compiler has such a directive, for instance:

    __asm__ __volatile__ ("" : : : "memory");

The "memory" part takes care of defeating caching, and the __volatile__
prevents code motion of the inlined code itself.

Of course, GCC doesn't need this in the context we are discussing, because
it will do ``the right thing'' with external function calls.

I've only used the above as a workaround to GCC optimization bugs.
It can also be used as the basis for inserting a memory barrier instruction:

    #define mb() __asm__ __volatile__ \
    ("" : : : "memory");

It's a good idea to do it like this so that the compiler's optimizations do
not make the memory barrier useless, by squirreling away data in registers
or moving the instruction around in the generated code.

This would be typically used in the implementation of a mutex function,
not in its interface, to ensure that internal accesses to the mutex object
itself are conducted properly. 

------
Hi,

I'm having trouble with the meaning of C/C++ keyword volatile. I know you
declare a variable volatile wherever it may be changed externally to the
flow of logic that the compiler is processing and optimising. This makes the
compiler read from the ultimate reserved storage when it is accessed (or so
I believed).

I have seen a discussion in one of the comp.lang.c* groups where it is
suggested that the compiler does not always have to avoid optimising away
memory accesses. This seems logical - since a thread which alters the value
of a variable might not get scheduled, the value of the variable may not
change for some time (many times round a busy loop), so the compiler can use
a cached value for many loops without changing the guarantees made by the
machine abstraction defined in the standards (*good* for performance). That
then renders volatile practically undefinable (since a thread may legally
*never* be scheduled) and when it is for hardware changing a flag, the
hardware doesn't necessarily change memory (a CPU register may be changed).
Volatile behaviour seems best implemented with a function call which uses
some guaranteed behaviour internally (in assembler or other language).

Has anyone hashed this out before and come to any conclusion whether to
trust volatile or spread to other languages? because it's doing my head in
(I ask this here because it concerns concurrent programming and the people
here probably have the experience of this problem).

--
Tristan Wibberley

 
In article ,
Kaz Kylheku  wrote:
>Under preemptive threading, the execution can be suspended *at any point* to
>invoke the scheduler; pthread_mutex_lock is not special in that regard. Yet
>compilers clearly do not treat each instruction with the same suspicion that
>pthread_mutex_lock deserves.

pthread_mutex_lock() is special. While threads may be pre-empted at
any point, they are not permitted to access shared data, so the order
in which the operations are performed is irrelevant. By calling
pthread_mutex_lock() a thread gains permission to access shared data,
so at that point the thread needs to update any local copies of that
data. Similarly, by calling pthread_mutex_unlock() a thread
relinquishes this permission, so it must have updated the shared data
from any local copies. Between these two calls, it is the only thread
which is permitted to access the shared data, so it can safely cache
as it likes.

>How about the following paragraph?
>
>    The values of all objects shall be made stable immediately
>    prior to the call to pthread_mutex_lock, pthread_mutex_unlock,
>    pthread_mutex_trylock and pthread_mutex_timedlock.  The
>    first abstract access to any object after a call to  one of the
>    locking functions shall be an actual access; any cached copy of
>    an object that is accessed shall be invalidated.

That is unnecessarily restrictive. Suppose we have a buffering scheme
which uses this code:
    for (;;) {
        lock
        while (!items_ready) wait(data)
        if (!--items_ready) signal(space)
        unlock
        use buffer[tail]
        tail = (tail + 1) % BUFLEN
    }

Analysis of the other uses of 'items_ready' may indicate that it can
be optimised into:
    local_ready = 0
    lastbatch = 0
    for (;;) {
        if (!local_ready) {
            lock
            items_ready -= lastbatch
            if (!items_ready) signal(space)
            while (!items_ready) wait(data)
            local_ready = lastbatch = items_ready
            unlock
        } else local_ready--
        use buffer[tail]
        tail = (tail + 1) % BUFLEN
    }

Of course the analysis required to determine that this is a valid
optimisation is not simple, and I would not expect to find it in
current compilers, but I don't think the standard should prohibit it.



Kaz Kylheku wrote:

> The standard is flawed because it doesn't mention that calls to
> pthread_mutex_lock and pthread_mutex_unlock must be treated specially.
> We all know how we want POSIX mutexes to work, and how they do work in
> practice, but it should also be codified in the standard, even though
> it may be painfully obvious.

The standard requires memory coherency between threads based on the POSIX
synchronization operations. It does NOT specifically dictate the compiler or system
behavior necessary to achieve that coherency, because it has no power over the C
language nor over the hardware. Besides, it really doesn't matter how the
requirements are achieved, nor by whom.

An implementation (thread library, compiler, linker, OS, hardware, etc.) that
doesn't make memory behave correctly with respect to POSIX synchronization
operations simply does not conform to POSIX. This means, in particular, (because
POSIX does not require use of volatile), that any system that doesn't work without
volatile is not POSIX. Can such a system be built? Certainly; but it's not POSIX.
(It's also not particularly usable, which may be even more important to some
people.)

OK, you want chapter and verse? Sure, here we go. POSIX 1003.1-1996, page 32:

2.3.8 memory synchronization: Applications shall ensure that access to any memory
location by more than one thread of control (threads or processes) is restricted
such that no thread of control can read or modify a memory location while another
thread of control may be modifying it. Such access is restricted using functions
that synchronize thread execution and also synchronize memory with respect to other
threads. The following functions synchronize memory with respect to other threads:

      fork()                 pthread_mutex_unlock()   sem_post()
      pthread_create()       pthread_cond_wait()      sem_trywait()
      pthread_join()         pthread_cond_timedwait() sem_wait()
      pthread_mutex_lock()   pthread_cond_signal()    wait()
      pthread_mutex_trylock()pthread_cond_broadcast() waitpid()

In other words, the application is reponsible for relying only on explicit memory
synchronization based on the listed POSIX functions. The implementation is
responsible for ensuring that correct code will see synchronized memory. "Whatever
it takes."

Normally, the compiler doesn't need to do anything it wouldn't normally do for a
routine call to achieve this. A particularly aggressive global optimizer, or an
implementation that "inlines" mutex operations, might need additional compiler
support to meet the requirements, but that's all beyond the scope of the standard.
The requirements must be met, and if they are, application and library developers
who use threads just don't need to worry. Unless of course you choose to try to
create your own memory synchronization without using the POSIX functions, in which
case no current standard will help you and you're entirely on your own on each
platform.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/
=================================TOP===============================
 Q309: passing messages, newbie?   

[email protected] wrote:

> Hi,
>
> Thank you all for the  info. i think i am
> on my way, things are working now.
>
> One more thing, what is the  correct way to
> make a thread sleep/delay.
> The "sleep()" call i think causes the whole
> process to sleep.

No, it cannot. At least, not in any legal implementation of POSIX threads.
(Nor, in my opinion, in any rational implementation of any usable thread
interface.)

In practice, this happens in some "cheap" (by which I do not mean
"inexpensive") pure user-mode threading libraries. These used to be common
and widely used. There are now real thread packages available "just about
everywhere", and you should run from any implementation with this sort of
"quirk".

In any implementation that uses "multiple kernel execution entities", the
buggy behavior would actually be difficult to achieve, and nearly
impossible to get by accident. Under Linux, for example, threads are really
independent Linux processes. Just try "accidentally" getting another
process to block when you sleep (or read from a file).

> I saw in a paper a call "pthread_delay_np()"
> call to delay a thread, but i couldnt find the
> call in man pages on my Linux 2.2.12 with glibc-2.1.12.

You won't find it on any "pure" implementation of POSIX threads, because
that function doesn't exist. It's from the ancient and obsolete DCE threads
package (which was a cheap user-mode implementation of a long since defunct
draft of the document that eventually became POSIX threads). Because we
couldn't count on having sleep() work, and instead of using somewhat less
portable means to supercede the real sleep() by something that would work,
we introduced pthread_delay_np(). (Which is modelled after nanosleep().)
The function is retained in our current implementation of POSIX threads (on
Tru64 UNIX and OpenVMS) as an extension, partly for "cultural
compatibility" to help people upgrading from DCE threads and partly
because, on OpenVMS, there are still compilation/link modes where we can't
count on sleep() working correctly.

> So we have to delay a thread using sleep() only.

This is correct. Or usleep(), nanosleep(), select(), or whatever is
appropriate.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/

 
=================================TOP===============================
 Q310:  solaris mutexes?   

Roy Gordon wrote:

> Is the following true:  Solaris mutexes only exist in the user address
> space (including any shared memory space); they have no associated
> kernel data structure.

True. For non-shared-memory mutexes. Ditto CVs & unnamed semaphores.

>
> If true, this would be opposed to system V semaphores.

Yup.

>
>
> Also, if true, then a given mutex could be moved to a different
> address (suitably aligned) and as long as all threads (or processes, as
> the case may be) reference it at that address, then it would continue
> functioning as if it hadn't been moved.
>
> Is this correct too (if the initial assumption is correct, that is)?

You mean like a compacting garbage collector? Yeah. 'Matter of
fact, I believe that's what Java does on some platforms.

-Bil

 

=================================TOP===============================
 Q311: Spin locks?  

I think it worth noting that spin locks are an efficency hack for
SMP machines which are useful under a small number of situations.
Moreover, there is nothing that prevents you from using spin locks
all the time (other than a slight loss of efficency).

In particular, in some libraries ALL locks are actually spin locks. Solaris
2.6 (or is that 7?) and above for example. If you call pthread_mutex_lock()
on an MP machine & the lock is held by a thread currently running on
another CPU, you WILL spin for a little while.

It is very unlikely you would EVER want to build a spin lock yourself.
(I mean, it would be kinda FUN and interesting, but not practical.)
If you *really* want to, go ahead, just time your program carefully.
$10 says you'll find home-spun spin locks won't help.

>
> BTW, SMP is a bad design that scales poorly. I wish someone could come
> up with a better design with some local memory & some shared memory
> instead.

Like democracy. It sucks, but we haven't come up with anything better :-)

-Bil

 

> > BTW, SMP is a bad design that scales poorly. I wish someone could come
> > up with a better design with some local memory & some shared memory
> > instead.
>
> Like democracy. It sucks, but we haven't come up with anything better :-)

You would like SGI's high-end monster-machines... Ours consists of eight
"node boards," each of which has two CPU's and a local memory pool (512MB or
so). All the memory in the machine is visible to all processors, but IRIX
intelligently migrates individual pages towards the processors that are
hitting them most. It's like another level of cache... As long as each
thread/process stays within a modest, unshared working set, the system
scales very well.

Dan 
 

-- 
Eppur si muove 
 


=================================TOP===============================
 Q312:  AIX pthread pool problems?   
Kaz Kylheku wrote:

> On Thu, 31 Aug 2000 17:52:08 GMT, sr  wrote:
> >/* lck.c */
> >/* AIX: xlc_r7 lck.c -qalign=packed -o lck */
> >/* LINUX: cc lck.c -lpthread -fpack-struct -o lck */
>
> Doh, have you reading the GNU info page for gcc? Here is what it
> says about -fpack-struct:
>
> `-fpack-struct'
>      Pack all structure members together without holes.  Usually you
>      would not want to use this option, since it makes the code
>      suboptimal, and the offsets of structure members won't agree with
>      system libraries.

Kaz is quite correct, but maybe not quite firm enough...

NEVER, ever, under any circumstances, "pack" any structure that you didn't define.
If a header wants its structures packed, it'll do it itself. If it doesn't ask,
don't presume to tell it what it should do.

You asked the compiler to break your mutexes, and it let you, because it didn't
know any better. Now you do know better, so stop asking it. ;-)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/
 


=================================TOP===============================
 Q313: iostream libray and multithreaded programs ?  

Bil,

I took a class that you taught at Xilinx. I have written some small
multithreaded programs. These programs work fine if I don't use iostream
library. If I use iostream, these program hanged with the following
message:
  libc internal error: _rmutex_unlock: rmutex not held.

I have attached two files with this message: problem.txt and good.txt.
These files show the compile options. The only difference is that the
broken version use a extra option "-library=iostream,no%Cstd", in both
compile and link lines. Unfortunately, this is the standard build option
at Xilinx. On one at Xilinx understands why this option will break my
program. Could you help me with this problem? Thank you very much.

Meiwei



--------------A5ABF2A5C05B09260A39D96B
Content-Type: text/plain; charset=us-ascii;
 name="good.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="good.txt"

1) make output:

  CC -O -c -I../ -DSOL -DDLLIMPORT="" -DTEMP -DDEBUG ../Port_ThrTest.c
  CC -O -c -I../ -DSOL -DDLLIMPORT="" -DTEMP -DDEBUG ../Port_ThrMutex.c
  CC -O -c -I../ -DSOL -DDLLIMPORT="" -DTEMP -DDEBUG ../Port_ThrCondition.c
  CC -O -c -I../ -DSOL -DDLLIMPORT="" -DTEMP -DDEBUG ../Port_ThrBarrier.c
  CC -O -c -I../ -DSOL -DDLLIMPORT="" -DTEMP -DDEBUG ../Port_ThrThread.c
  "../Port_ThrThread.c", line 49: Warning (Anachronism): Formal argument 
start_routine of type extern "C" void*(*)(void*) in call to pthread_create(unsigned*, 
const _pthread_attr*, extern "C" void*(*)(void*), void*) is being passed void*(*)(void*).
  1 Warning(s) detected.
  CC  -L. -o Port_ThrTest Port_ThrMutex.o Port_ThrCondition.o Port_ThrBarrier.o 
Port_ThrThread.o Port_ThrTest.o -lpthread -lposix4

2) ldd output:

    libpthread.so.1 =>       /usr/lib/libpthread.so.1
    libposix4.so.1 =>        /usr/lib/libposix4.so.1
    libCrun.so.1 =>  /usr/lib/libCrun.so.1
    libm.so.1 =>     /tools/sparcworks5.0/SUNWspro/lib/libm.so.1
    libw.so.1 =>     /usr/lib/libw.so.1
    libc.so.1 =>     /usr/lib/libc.so.1
    libaio.so.1 =>   /usr/lib/libaio.so.1
    libdl.so.1 =>    /usr/lib/libdl.so.1
    libthread.so.1 =>        /usr/lib/libthread.so.1

 


=================================TOP===============================
 Q314:  Design document for MT appli?   

> > If you describe what it is you need to acheive, someone in this forum
> > advise against using threads, or if they think threads will be good,
> > how to use them for best effect.

Here is something I recently posted to the Linux kernel list:

------

Let's go back to basics. Take a look inside your computer. What do you see?

1) one (or more) CPUs
2) some RAM
3) a PCI bus, containing:
4)   -- a SCSI/IDE controller
5)   -- a network card
6)   -- a graphics card

These are all the parts of your computer that are smart enough to accomplish
some amount of work on their own. The SCSI or IDE controller can read data
from disk without bothering any other components. The network card can send
and receive packets fairly autonomously. Each CPU in an SMP system operates
nearly independently. An ideal application could have all of these devices
doing useful work at the same time.

When people think of "multithreading," often they are just looking for a way
to extract more concurrency from their machine. You want all these
independent parts to be working on your task simultaneously. There are many
different mechanisms for achieveing this. Here we go...

A naively-written "server" program (eg a web server) might be coded like so:

* Read configuration file - all other work stops while data is fetched from
disk
* Parse configuration file - all other work stops while CPU/RAM work on
parsing the file
* Wait for a network connection - all other work stops while waiting for
incoming packets
* Read request from client - all other work stops while waiting for incoming
packets
* Process request - all other work stops while CPU/RAM figure out what to do
                  - all other work stops while disk fetches requested file
* Write reply to client - all other work stops until final buffer
transmitted

I've phrased the descriptions to emphasize that only one resource is being
used at once - the rest of the system sits twiddling its thumbs until the
one device in question finishes its task.


Can we do better? Yes, thanks to various programming techniques that allow
us to keep more of the system busy. The most important bottleneck is
probably the network - it makes no sense for our server to wait while a slow
client takes its time acknowledging our packets. By using standard UNIX
multiplexed I/O (select()/poll()), we can send buffers of data to the kernel
just when space becomes available in the outgoing queue; we can also accept
client requests piecemeal, as the individual packets flow in. And while
we're waiting for packets from one client, we can be processing another
client's request.

The improved program performs better since it keeps the CPU and network busy
at the same time. However, it will be more difficult to write, since we have
to maintain the connection state manually, rather than implicitly on the
call stack.


So now the server handles many clients at once, and it gracefully handles
slow clients. Can we do even better? Yes, let's look at the next
bottleneck - disk I/O. If a client asks for a file that's not in memory, the
whole server will come to a halt while it read()s the data in. But the
SCSI/IDE controller is smart enough to handle this alone; why not let the
CPU and network take care of other clients while the disk does its work?

How do we go about doing this? Well, it's UNIX, right? We talk to disk files
the same way we talk to network sockets, so let's just select()/poll() on
the disk files too, and everything will be dandy... (Unfortunately we can't
do that - the designers of UNIX made a huge mistake and decided against
implementing non-blocking disk I/O as they had with network I/O. Big booboo.
For that reason, it was impossible to do concurrent disk I/O until the POSIX
Asynchronous I/O standard came along. So we go learn this whole bloated API,
in the process finding out that we can no longer use select()/poll(), and
must switch to POSIX RT signals - sigwaitinfo() - to control our server***).
After the dust has settled, we can now keep the CPU, network card, and the
disk busy all the time -- so our server is even faster.


Notice that our program has been made heavily concurrent, and I haven't even
used the word "thread" yet!


Let's take it one step further. Packets and buffers are now coming in and
out so quickly that the CPU is sweating just handling all the I/O. But say
we have one or three more CPU's sitting there idle - how can we get them
going, too? We need to run multiple request handlers at once.

Conventional multithreading is *one* possible way to accomplish this; it's
rather brute-force, since the threads share all their memory, sockets, etc.
(and full VM sharing doesn't scale optimally, since interrupts must be sent
to all the CPUs when the memory layout changes).

Lots of UNIX servers run multiple *processes*- the "sub-servers" might not
share anything, or they might file cache or request queue. If we were brave,
we'd think carefully about what resources really should be shared between
the sub-servers, and then implement it manually using Linux's awesome
clone() API. But we're not, so let's retreat to the brightly-lit
neightborhood that is pthreads.

We break out the POSIX pthread standard, and find it's quite a bit more
usable than AIO. We set up one server thread for each CPU; the threads now
share a common queue of requests****. We add locking primitives around the
shared data structures in our file cache. Now as soon as a new packet or
disk buffer arrives, any one of the CPUs can grab it and perform the
associated processing, while the other CPUs handle their own work. The
server gets even faster.


That's basically the state-of-the-art in concurrent servers as it stands
today. All of the independent devices in the computer are being used
simultaneously; the server plows through its workload, never waiting for
network packets or disk I/O. There are still bottlenecks - for instance, RAM
and PCI bandwidth are limited resources. We can't just keep adding more CPUs
to make it faster, since they all contend for access to the same pool of RAM
and the same bus. If the server still isn't fast enough, we need a better
machine architecture that separates RAM and I/O busses into
concurrently-accessible pools (e.g. a high-end SGI server).

There are various other tricks that can be done to speed up network servers,
like passing files directly from the buffer cache to the network card. This
one is currently frowned upon by the Linux community, since the time spent
copying data around the system is small compared to the overhead imposed by
fiddling with virtual memory. Lots of work does go into reducing system call
and context switch overhead; that's one of the reasons TUX was developed.


Let's drop the "web server" example and talk about another application that
benefits from concurrency - number crunching. This is a much simpler case,
since the only resources you're worried about are the CPUs and RAM. To get
all the CPU's going at once, you'll need to run multiple threads or
processes. To get truly optimal throughput, you might choose to go the
process route, so that shared memory is kept to an absolute minimum. (Not
that pthreads is a terrible choice; it can work very well for this purpose)


In summary, when "multithreading" floats into your mind, think
"concurrency." Think very carefully about how you might simultaneously
exploit all of the independent resources in your computer. Due to the long
and complex history of OS development, a different API is usually required
to communicate with each device. (e.g. old-school UNIX has always handled
non-blocking network I/O with select(), but non-blocking disk I/O is rather
new and must be done with AIO or threads; and don't even ask about
asynchronous access to the graphics card =).

Don't let these differences obscure your goal: just figure out how to use
the machine to its fullest potential. That's the Linux way of doing things:
think, then act.


-- Dan


The ideas here mostly come from informative pages like Dan Kegel's "C10K"
http://www.kegel.com/c10k.html, and from reading various newsgroup postings
and UNIX books.


*** POSIX AIO is so ugly, in fact, that it's not unheard-of to simply spawn
a pool of threads that handle disk I/O. You can send requests and replies
via a pipe or socket, which fits right in with the old select()/poll() event
loop

*** If we're servicing many, many clients at once, then running a huge
select()/poll() in each thread will have outrageous overhead. In that case,
we'd have to use a shared POSIX I/O signal queue, which can be done with
clone(), but not pthreads()... See Zach Brown's phhttpd
http://www.zabbo.net/phhttpd/
buy.

 
=================================TOP===============================
 Q315:  SCHED_OTHER, and priorities?   

Dale Stanbrough wrote:

> Patrick TJ McPhee wrote:
>
> > % thinking about them. I simply wanted to know if, given two threads that
> > % are available to run with different priorities, will SCHED_OTHER
> > % -always- choose the higher priority thread. Also will SCHED_OTHER
> > % -never- preempt a higher priority thread simply to run one of lower
> > % priority?
> >
> > SCHED_OTHER does not specify any particular scheduling policy. The
> > behaviour will vary from system to system.
>
> No it doesn't have to. There could be a part of the POSIX reference that
> says something like
>
>    "Under no circumstances should any scheduling policy preempt a
>     higher priority thread to run a lower priority thread".

There IS, for the realtime priorities that are defined by POSIX. But the whole
point of SCHED_OTHER (and for many good reasons) is to provide a "standard
name" for a policy that doesn't necessarily follow any of the POSIX rules.

> However it seems from other people's posting that there is no such
> restriction made, or in some cases, possible to be made. I suppose
> the next logical question to ask is...
>
>    Are there any SCHED_OTHER or other named policies other than
>    SCHED_RR and SCHED_FIFO that -do- such preemption?

That depends on your definitions and point of view. In the simplest terms,
from an external user view, the answer is a resounding "yes".

That's because many SCHED_OTHER implementations are based on standard
UNIX timeshare scheduling, for very good reasons. (That is, it has a long
history, it behaves in reasonable ways, and, perhaps most importantly, it
behaves in generally and widely understood ways.)

Reduced to general and implementation-independent terms, it works roughly like
this: each entity (thread or process) has TWO separate priorities, a "base"
priority and a "current" priority. You set (and see) only the base priority,
but the scheduler operates entirely on the current priority. This priority may
be adjusted to ensure that, over time, all entities get a "fair" share of the
processor resources available. Either compute-bound entities' current priority
may be gradually reduced from the base (where higher priorities are "better",
which is the POSIX model but not the traditional UNIX model), and/or entities
that block may be be gradually increased from the base. The net result is that
entities that have used a lot of CPU won't be given as much in the future,
while entities that haven't gotten much will be given more. In the end, it all
more or less evens out.

>From the scheduler's point of view, higher priority entities are always
preferred. From your point of view, though, your high priority threads may
behave as if they were low priority, or vice versa. Truth is sometimes not
absolute. ;-)

Of course, the POSIX standard doesn't even require that level of "truth". It's
perfectly reasonable for SCHED_OTHER to completely ignore the specified
priority. (There's no requirement that SCHED_OTHER even have a scheduling
parameter.) Would such an implementation be useful? Not for some people
certainly... but they probably ought to be using realtime policies, which are
fully defined by POSIX.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation   http://members.aol.com/drbutenhof |
| 110 Spit Brook Rd ZKO2-3/Q18, Nashua NH 03062-2698              |
\--------[ http://www.awl.com/cseng/titles/0-201-63392-2/ ]-------/ 


=================================TOP===============================
 Q316:  problem with iostream on Solaris 2.6, Sparcworks 5.0?   

I found the cause of my problem. In my company, we have build tools that generate
makfiles with list of options for compiling and linking. The link option list
ends with -Bstatic. -mt implicitly appends -lthread and other libraries to the
link command. This causes ld to look for libthread.a instead of libthread.so. The
following link error went away once I removed -Bstatic.

Meiwei

"Webster, Paul [CAR:5E24:EXCH]" wrote:

> Meiwei Wu wrote:
> >
> > My test program is linked with a shared library. This shared library was
> > compiled and linked with -mt option.
> > If I compiled and linked with -mt option, I would get the following link
> > error:
> >
> >   ld: fatal: library -lthread: not found
>
> It sounds like your compiler isn't installed properly.  The WS5.0
> documentation says that to be multithreaded, your files must be compiled with
> the -mt option (which defines _REENTRANT for you) and linked with the -mt
> option (which links the thread library and libC_mtstubs in the correct order
> for you).
>
> Also, 5.0 on the sun has broken MT capabilities, especially when it comes to
> iostreams.  There are 3 patches available which help to fix this (and a bunch
> of other things):
> 107357-09
> 107311-10
> 107390-10
>
> --
> Paul Webster 5E24 [email protected] - My opinions are my own :-) -
> Fifth Law of Applied Terror: If you are given an open-book exam, you will
>     forget your book.  Corollary: If you are given a take-home exam, you
>     will forget where you live. 


=================================TOP===============================
 Q317:  pthread_mutex_lock() bug ???   

[email protected] writes:

>Thanks for pointing it out. I had made a mistake. I was working at a
>Solaris 2.6 machine while looking at an older Solaris 2.5 Answerbook.
>To my surprise 2.6 man pages do not specify an EPERM return value
>though they make comments like "only the thread that locked a mutex can
>unlock it" in man pthread_mutex_unlock.
>My guess is that 2.5 and 2.6 had a bug but rather than fixing it in
>2.6, they just deleted the EPERM part from the RETURN part of the
>manual.


No, not a bug.  This is completely intentional.

Mutexes require some expensive bus operations; if you do error
checking on unlock you suddenly require more of those expensive operations, so
mutex_unlock becomes a *lot* slower.

>Not really POSIX ensures that implementation will let only the locked
>thread to unlock it. This is acknowledge by solaris as well in
>their "only the thread that locked a mutex can unlock it" phrase. I
>also read this in Kleiman et. al 's "Programming with threads".

You should really read this as "you're not allowed to do so
but if you do all bets are off"

>Andrew>If you want mutex locking to be error-checked, you need to
>Andrew>create the mutex with the PTHREAD_MUTEX_ERRORCHECK type
>attribute.
>PTHREAD_MUTEX_ERRORCHECK is a type of teh mutex. Though I m not sure I
>suspect this was not there in the initial standard ( Both my books on
>pthreads do not make any mention of it ). Solaris did not have a
>pthread_mutexattr_settype() interface till 2.7 ( or 2.8 ??. It
>definitely wasn t there till 2.6 ). Instead this was the default
>behavior as per the man pages.


All comes clear when you read the unlock page in S7:

     If the mutex type is  PTHREAD_MUTEX_NORMAL, deadlock  detec-
     tion  is not provided. Attempting to relock the mutex causes
     deadlock. If a thread attempts to unlock a mutex that it has
     not  locked or a mutex which is unlocked, undefined behavior
     results.

     If the mutex type is  PTHREAD_MUTEX_ERRORCHECK,  then  error
     checking is provided. If a thread attempts to relock a mutex
     that it has already locked, an error will be returned. If  a
     thread  attempts to unlock a mutex that it has not locked or
     a mutex which is unlocked, an error will be returned.

Casper  


=================================TOP===============================
 Q318:  mix using thread library?   

"Christina,

Hello. Yes, you can use both pthreads & solaris threads at the same time.
As a matter of fact, you do almost all the time! Most of the Solaris libraries
are written using Solaris threads.

I doubt that your hang has anything to do with mixing the two libraries.
But... how do you build your program? What does your compile line look
like?

If it hangs at malloc_unlock, then I would be suspicious that somewhere
your code is corrupting the library. I would use purify (or the Sun debugger's
bounds checker) to be certain that I wasn't writing past the end of an
array or to an invalid pointer.

-Bil

> Bil:
>     How are you!
>     This is Christina Li at Lucent Technologies. I have a specific question for
> MT programming, hopefully not too bother you.
>     In the man page of thr_create, it shows like we can use together both
> pthread and Solaris thread library in an application( on Unix Solaris 2.5 ), and
> most of the books don't talk anything about mixed using the two libraries. But I
> heard from some people , that it is not safe to use both pthread and Solaris
> thread at the same time in an application.
>     I have a large application, over 150K line code. Sometimes it just
> mysteriously hang at some very low level, like malloc_unlock or pthread_unlock
> or other places.
> It seems like if I build the application with POSIX_PTHREAD_SEMANTICS and linked
> with pthread helps to bypass some hang.
>     But I am really not sure whether it is a true fix or not, I am quite
> confused of this, would you please help to share some of your ideas?
>
>     Thanks very much!
>
> Christina Li.
 


=================================TOP===============================
 Q319:  Re: My agony continues (thread safe gethostbyaddr() on FreeBSD4.0) ?  

Stephen Waits  writes:

>Well, hoping to avoid serial FQDN resolution, I trashed gethostbyaddr()
>and attempted to write my own "thread-safe" equivalent.

>After much research and plodding through nameser.h, this is what I ended
>up with.  The problem is that it STILL doesn't seem to be thread-safe. 
>If I mutex wrap the call to res_query() my proggie works great, just
>that DNS lookups remain serial :(   I'm assuming res_query() uses some
>static data somewhere along the line (going to read the source right
>after this post).

>ANY suggestions on where to go next (besides "use ADNS") much
>appreciated!


Run Solaris (I think from 7 onwards we have a multithreaded resolver library).

I believe future bind versions will have a threaded library.  Perhaps
bind 9 has it?


Of course, in Solaris the issue of concurrent lookups was somewhat
more pressing with one daemon doing all the lookups.

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems. 


=================================TOP===============================
 Q320:  OOP and Pthreads?   


"Mark M. Young"  wrote in message
news:[email protected]...
> [...]
> After having said this, are you still
> suggesting that I use a function or macro to compare the addresses
> and lock the objects accordingly?  Is this common industry practice
> or something?

I can only speak for myself, but I have used it.

> So, the burden of locking an extra mutex, having an extra mutex in
> existence, and hurting serialization outways the burden of the
> complicated comparisons of addresses that might arise (e.g. 4
> objects)?  Once you go beyond 3 objects, the code to perform the
> comparisons would be rediculous and I would like to have a clean
> technique used universally.

It is actually not that ridiculous if you can use the C++ STL
library. Here's an example:

  #include 
  #include 
  using namespace std;


  class ADT {
  public:
    ADT& operator += (const ADT& b);

  private:
    friend class LockObjects;

    int lock() {
      cout << "Locked  : " << this << endl;
      return 0; // Required by my C++/STL implementation.
    };

    int unlock() {
      cout << "Unlocked: " << this << endl;
      return 0;
    };
  };


  class LockObjects : private priority_queue {
  public:
    LockObjects(ADT* pArg1, ADT* pArg2)  {
      push(pArg1);
      push(pArg2);

      for_each(c.begin(), c.end(), mem_fun(&ADT;::lock));
    }

    ~LockObjects() {
      for_each(c.rbegin(), c.rend(), mem_fun(&ADT;::unlock));
    }
  };


  ADT& ADT::operator += (const ADT& rhs) {
    LockObjects
      lock(this, const_cast(&rhs;));

    // Add the two.

    return *this;
  }


  int main() {
    ADT a, b;

    a += b;

    cout << endl;

    b += a;

    return 0;
  }

Pushing into the priority_queue sorts the objects in decreasing
address order. In practice you probably want to separate the
sorting of the objects and the actual locking of them.

I have used this scheme on several occasions and it has worked
quite nicely. I don't know how expensive the use of priority_queue
is, as it in my case has not been of particular importance.
--
[email protected]



 

All the other responders (Kylhelku,Wikman,Butenhof) had excellent
suggestions.  I prefer the key to the address for sorting the locks, for the
reason KK mentioned, but if you have operator< defined for your ADTs, you
could potentially use that for ordering, taking care in the case of
equality.  Wikman's priority queue (you could also use an STL set) is a nice
way to get the sorting cleanly.

I'm not sure whether I was clear in warning you off the class lock.  If you
have thousands of ADTs that are supposed to participate in binary operations
in multiple threads, by introducing the class lock you force the thousands
of operations to be carried out sequentially -- no parallelism is possible.
This will matter a lot on an SMP box, or if any of the operations involve
i/o or something else that blocks the processor.  Comparing a few addresses
(usually a pair of addresses) is trivial in comparison.

The suggestion to provide locking primitives and let the higher-level code
decide how to use them is very valuable.  Typically you provide, e.g.,
operator+ which grabs the locks on its operands, and then calls plusInternal
which does the addition.  Then operator+= grabs the locks and calls
plusInternal and assignInternal, thus avoiding the need for recursive
mutexes and a lock on the temporary.  If you are multiplying two matrices of
these ADTs, matrixMultiply grabs locks on all the elements of all the
matrices, then calls all the usual multiplyInternal and plusInternal members
avoiding a horrendous number of recursive lock/unlock calls.  If you had a
parallelMatrixMultiply operation, it would lock all the elements of both
matrices (and possibly the result matrix), hand off groups of rows from one
matrix and groups of columns from the other to the participating threads,
which would use the multiplyInternal,plusInternal and assignInternal
operations and proceed without taking any additional locks.

This is much more in the spirit of how the STL is used in a threadsafe
fashion (see the discussion in
http://www.sgi.com/Technology/STL/thread_safety.html).

By the way, you can implement a recursive mutex on top of the normal
pthread_mutex by wrapping pthread_mutex in a class which holds the owner
thread id and a reference counter.  When you are trying to lock, you see if
you are already the owner.  If so, you increment the reference count.  If
you are not, you call the pthread_mutex_lock function.  When you unlock, you
decrement the reference counter and only if it's zero do you call
pthread_mutex_unlock.  A full implementation can be found in the ACE
(Adaptive Communications Environment) library,
http://www.cs.wustl.edu/~schmidt/ACE.html

Jeff
 


=================================TOP===============================
 Q321:  query on threading standards?   

Bill,

I don't know if you recall or not, but we talked at Usenix in San Diego
about threading under Linux.

I was curious about your opinion on a couple of things.

(1)  Have you been following the threading discussion on linux kernel
(summarized at
       http://kt.linuxcare.com/kernel-traffic/kt20000911_84.epl#1for those
of us that have lives)?
       I wondered if you had any opinions on the Posix thread discussion
and Linus's evaluation
       of pthreads.
(2)  Is there a current POSIX standard for Pthreads?  Where might one find
or obtain this.
       I couldn't find a reference to where the standard is in
"Multithreaded Programming with
       pthreads".
(3)  There also has been a lot of discussion on the mailing list about some
changes
       that Linux has put into 2.4.0-test8 to support "thread groups".
This is a way to
       provide a container for Linux threads (the process provides this
container on most
       other operating systems).  Apparently this breaks the current Linux
implementation
       of pthreads.  But other than that it is a good thing should allow a
better implementation
       of pthreads under Linux, but no details are forthcoming at the
moment.....

Best Regards,

Ray Bryant
IBM Linux Technology Center
[email protected]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
 --Eric S. Raymond 

Ray,

Yeah, I remember. It was right after the BOF, right?

> Bill,
>
> I don't know if you recall or not, but we talked at Usenix in San Diego
> about threading under Linux.
>
> I was curious about your opinion on a couple of things.
>
> (1)  Have you been following the threading discussion on linux kernel
> (summarized at
>        http://kt.linuxcare.com/kernel-traffic/kt20000911_84.epl#1for those
> of us that have lives)?
>        I wondered if you had any opinions on the Posix thread discussion
> and Linus's evaluation
>        of pthreads.

I just took a gander at it. Now I won't say I got it all figured out in 15
minutes
of scanning, but Linus is, as is often the case, full of himself. He talks about

"Right" (Linux) vs. "Engineered" (POSIX) as a moral battle. I probably
agree with him on many points, but at the end of the day I want something that
works & is used. PThreads fits the bill.

(I notice that Linus is not about using other engineered solutions...)


>
> (2)  Is there a current POSIX standard for Pthreads?  Where might one find
> or obtain this.
>        I couldn't find a reference to where the standard is in
> "Multithreaded Programming with
>        pthreads".

It's in there. For $130 (no free on-line access :-(  ) you can buy it from
IEEE.  NB: POSIX (1995) vs. UNIX98 (the follow on with a few slight
changes.)

>
> (3)  There also has been a lot of discussion on the mailing list about some
> changes
>        that Linux has put into 2.4.0-test8 to support "thread groups".
> This is a way to
>        provide a container for Linux threads (the process provides this
> container on most
>        other operating systems).  Apparently this breaks the current Linux
> implementation
>        of pthreads.  But other than that it is a good thing should allow a
> better implementation
>        of pthreads under Linux, but no details are forthcoming at the
> moment.....

No idea about that. It sounds very odd, considering Linus' railing about
LWPs being broken. I'd want to hear a *very* good reason for threads
groups, esp. as they are pretty much useless in Java.

-Bil 

Bil,

Yes, we talked right after the BOF at Usenix.

I think the thing that upsets me about the Linus/Linux discussion on
threading is the utter contempt
the Linux core team has for the rest of the world.  I mean I am willing to
accept that POSIX threads
were not designed to fit into the Linux threading model.  But it seems to
me that the people who
worked on the POSIX thread standard were trying to pull together a
consensus among a wide
variation of expecations, requirements, and that they likely did this in a
concientious, diligient,
and competent way.  To that work "shit or crap" is disrepectful to the
people who tried (quite
hard) to make it a good standard.  Oh well.

Yes, I eventually found in your book where to go order the POSIX
specification.  But I figured
not only would I not get my management to cough up $143; I really didn't
want to read a 784
page document.  I think I am going to go with the Butenhof book instead.

The thread group changes that Linus has put in are an attempt to provide a
"container" for
a program's threads.  Using this, one can send a signal to the container
and have one of
the threads in the thread group that is enabled for the signal be the one
that gets the signal;
as per POSIX semantics.  At the moment,, however, the changes break
pthreads.  Oh well,
again.

Best Regards,

Ray Bryant
IBM Linux Technology Center
[email protected]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
 --Eric S. Raymond
 


=================================TOP===============================
 Q322:  multiprocesses vs multithreaded..??   

>
>Patrick TJ McPhee wrote:
>> 
>> In article <[email protected]>,
>> David Schwartz   wrote:
>> 
>> %       Threads are inherently faster than multiple proceses.
>> 
>> Bullshit.
>
>    Refute it then.

For Linux users, Question/Answer 6 of this interview:

http://slashdot.org/interviews/00/07/20/1440204.shtml

by kernel developer Ingo Molnar on the TUX webserver is
quite illuminating -- he notes the context switch time for
two threads and for two processes under Linux is identical,
around 2 microseconds on a 500 MHz PIII. Ingo makes
the case using threads (instead of just fork()'ing off
processes) under Linux should be reserved for:

  "where there is massive and complex interaction between
   threads. 98% of the programming tasks are not such. 
   Additionally, on SMP systems threads are *fundamentally
   slower*, because there has to be (inevitable, hardware-
   mandated) synchronization between CPUs if shared VM is used."


Just passing along an interesting interview, not necessarily
my personal opinion -- actually, on the project I'm working on
these days, using non-blocking I/O of multiple streams in a single
process turns out to be the best way of doing things, so I 
vote for "none of the above" :-).

-------------------------------------------------------------------------
John Lazzaro -- Research Specialist -- CS Division -- EECS -- UC Berkeley
lazzaro [at] cs [dot] berkeley [dot] edu     www.cs.berkeley.edu/~lazzaro
-------------------------------------------------------------------------
--  


=================================TOP=============================== ?

Check http://sources.redhat.com/pthreads-win32/


Regards,
Jani Kajala

"Mark M. Young"  wrote in message
news:[email protected]...
> I've read the FAQ, I've searched the net.  Could someone help me to a
> Win32 Pthreads implementation (I don't use Windows by choice)?
 


=================================TOP===============================
 Q323:  CGI & Threads?   

Do a web search for a standard called "Fast CGI".  It basically uses a LWP
with a thread pool to service CGI requests.  The nice thing about Fast CGI
is that any program written to be a Fast CGI executable is still compatible
as a standard CGI executable.

Regards,

Shelby Cain

"Terrance Teoh"  wrote in message
news:[email protected]...
> Hi,
>
> Has anyone seen any CGI done in either C / C++ using threads ?
> Basically I am thinking of reducing resources taken up when there are
> too many
> access being done at the same time ?
>
> Thoughts ? Pointers ? Comments ?
>
> Thanks !
> Terrance
 


=================================TOP===============================
 Q324:  Cancelling detached threads (posix threads)?   

Jason Nye wrote:

> I'm trying to find out whether the posix specification allows the
> cancellation of detached threads. I have a copy of Butenhof's book and in
> the posix mini-reference at the end of the book, he says that pthread_cancel
> should fail if the target thread is detached. This makes sense to me, but is
> this the correct behaviour? For example LinuxThreads does allow cancellation
> of a detached thread -- who is correct?

This is actually an ambiguity in the standard. An implementation that allows
cancellation of a detached thread doesn't violate the standard. HOWEVER, other
provisions of the standard make such an allowance of questionable value, at
best. For example, when a detached thread terminates the state of that thread
is immediately invalidated, and may be reused immediately for a new thread. At
that point (which you cannot determine) you would be cancelling some new thread
that you probably didn't create, with possibly disastrous consequences to your
application.

There's no excuse for ever cancelling a detached thread. If you do, you may be
breaking your application. If it works today, it might not work tomorrow, for
reasons you cannot easily determine.

In other words, regardless of what the standard says, this is an application
error that individual implementations may or may not detect. (And in the
general case, implementations that reuse pthread_t values cannot detect the
cases where it really matters, because when you cancel a reused pthread_t, the
value is valid at the time.)

So, if you believe my interpretation, you're less likely to get yourself into
trouble by taking advantage of dangerous (and in the final analysis, unusable)
loopholes provided by other implementors. ;-)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
 


=================================TOP===============================
 Q325: Solaris 8 recursive mutexes broken?   

"Bill Klein"  wrote in message
news:[email protected]...
> Hey,
>
> I came on to this newsgroup prepared to ask exactly this question - I'm
trying to get recursive mutexes working under Solaris 7 but
> am having no luck at all.
>
> "Spam Me Not"  wrote...
> > Confirmed from sun's web page:
> >
http://sunsolve.sun.com/pub-cgi/retrieve.pl?doc=fpatches%2F106980&zone;_32=42
88299
>
> Have you tried the patch? Does it actually solve all problems?
>
> Thanks!

Yes, I tried the patch.  Look a few messages back on this
message thread, where I posted a program to test this, and the
4 resulting cases.   Case #2 was fixed by the patch, which was
simple pthread_mutex_lock and pthread_mutex_unlock of a recursive
mutex by multiple pthreads.  The patch did not fix case #1, where
I used a mutex that happens to be recursive (but where the
recursion is never used AFAIK) in a pthread_cond_wait.  In this
case, the cond_wait never returns, so there's still a problem
with recursive mutexes even with this patch.  Note that sun's
documentation suggests not using recursive mutexes in cond_waits,
because a recursive mutex with count > 1 won't be fully released
by cond_wait, but I don't think that note applies to my program,
which doesn't take recursive locks of the mutex.

Anyone know how to submit a bug report to sun? :)  Do you have to
register on sunsolve first?

 
    [email protected] writes:
>The problem first showed up in Solaris 7 and there was a
>patch (106980-13) that fixed the problem.
>
>Now I've upgraded my system to Solaris 8 and the problem is
>back.  Obviously the Solaris 7 patch was not propagated
>into the Solaris 8 code and there does not appear to be a
>patch for the Solaris 8 system.  I've searched the SunSolve
>web site with no luck so far.

The Solaris 8 sparc equivalent patch for 106980-13 is 108827-05.

>Does anyone know about this problem and if so, is there a

I don't know what the problem is, because you haven't said.

>workaround for it?

A quick mod to the program to dump out the error gives:

.
.
.
lock: 251, thread 4
lock: 252, thread 4
lock: 253, thread 4
lock: 254, thread 4
lock failed: Resource temporarily unavailable, 255, thread 4
lock failed: Resource temporarily unavailable, 256, thread 4
lock failed: Resource temporarily unavailable, 257, thread 4
lock failed: Resource temporarily unavailable, 258, thread 4
.
.
.

"Resource temporarily unavailable" = EAGAIN

man pthread_mutex_lock...
ERRORS
     EAGAIN
           The mutex could not be acquired  because  the  maximum
           number   of    recursive  locks  for  mutex  has  been
           exceeded.

So what's the fault?  It seems to be behaving exactly as described.

-- 
Andrew Gabriel
Consultant Software Engineer
 


=================================TOP===============================
 Q326:  sem_wait bug in Linuxthreads (version included with glibc 2.1.3)?   

Jason Andrew Nye wrote:

> The problem is that POSIX-style cancellation is very dangerous in C++ code
> because objects allocated on the stack will never have their destructors
> called when a thread is cancelled (leads to memory leaks and other nasty
> problems).

This statement is not strictly true. Only an implementation of POSIX thread
cancellation that completely ignores C++, combined with an implemenation of
C++ that completely ignores POSIX thread cancellation, results in a dangerous
environment for applications and use both in combination. Because POSIX
cancellation was designed to work with exceptions (it was in fact designed to
be implemented as an exception), the combination is obvious and natural, and
there's simply no good excuse for it to not work.

Personally, I think it's very near criminal to release an implemenation where
C++ and cancellation don't work together. Developers who do this may have the
convenient excuse that "nobody made them" do it right. The C++ standard doesn't
recognize threads, and POSIX has never dealt with creating a standard for the
behavior of POSIX interfaces under C++. (Technically, none of the POSIX
interfaces are required to work under C++, though you rarely see a UNIX where
C++ can't call write(), or even printf().) Excuses are convenient, but this is
still shallow and limited thinking. I don't understand why anyone would be
happy with releasing such a system.

I spent a lot of time and energy educating the committee that devised the ABI
specification for UNIX 98 on IA64 to ensure that the ABI didn't allow a broken
implementation. Part of this was simply in self defense because a broken ABI
would prohibit a correct implementation.  I'd also had some hope that the
reasonable requirements of the ABI would eventually percolate up to the source
standard. More realistically, though, I hoped that by forcing a couple of C++
and threads groups to get together and do the obviously right (and mandatory)
thing for IA64, they might do the same obviously right (though not mandatory)
thing on their other platforms. Maybe someday it'll even get to Linux.

Please don't settle for this being broken. And especially, don't believe that
it has to be that way. Anyone who can implement C++ with exceptions can create
a language-independent exception facility that can equally well be used by the
thread library -- and, with a few trivial source extensions, by C language code
(e.g., though the POSIX cleanup handler macros).

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
 
On Mon, 06 Nov 2000 09:16:42 -0500, Dave Butenhof 
wrote:
>environment for applications and use both in combination. Because POSIX
>cancellation was designed to work with exceptions (it was in fact designed to
>be implemented as an exception), the combination is obvious and natural, and
>there's simply no good excuse for it to not work.

What should the semantics be, in your opinion? POSIX cleanup handles first,
then C++ unwinding? Or C++ unwinding first, then POSIX cleanup handlers? Or
should the proper nesting of cleanup handlers and C++ statement blocks be
observed?

I understand that in the Solaris implementation, the POSIX handlers are done
first and then the C++ cleanup.  How about Digital UNIX? My concern is what GNU
libc should do; where there isn't a standard, imitating what some other popular
implementations do would make sense.
 
My opinion is that they should be executed in the only possible correct or useful
order.

( ;-) -- but only for the phrasing, not the message.)

Each active "unwind scope" on the thread must be handled in order. (The opposite
order from that in which they were entered, of course.)

The obvious implementation of this is that both C++ destructors (and catch
clauses) and POSIX cleanup handlers, are implemented as stack frame scoped
exception handlers, and that each handler is executed, in order, as the frame is
unwound by a single common unwind handler.

Any other order will break one or the other, or both.

> I understand that in the Solaris implementation, the POSIX handlers are done
> first and then the C++ cleanup.  How about Digital UNIX? My concern is what GNU
> libc should do; where there isn't a standard, imitating what some other popular
> implementations do would make sense.

I don't know the details of the Solaris implementation, but what you describe is
clearly broken and useless except in trivial and contrived examples.

We, of course, do it "correctly", though it could be cleaner. For example, right
now C++ code can't catch a cancel or thread exit except with the overly general
"catch(...)", because C++ isn't allowed to use pthread_cleanup_push/pop, (and
shouldn't want to since C++ syntax is more powerful), and C++ doesn't have a name
for those "foreign" exceptions. (Of course destructors work fine.) We've worked
with the compiler group to add some builtin exception subclasses to deal with
that, but we never found the time to finish hooking up all the bits.

Our UNIX was architected from the beginning with a universal calling standard that
supports call-frame based exceptions. All conforming language processors must
provide unwind information (procedure descriptors) for all procedures, and a
common set of procedures (in libc and libexc) support finding and interpreting the
descriptors and in unwinding the stack. Our C compiler provides extensions to
allow handling these native/common exceptions from C language code. Our
 uses these extensions to implement POSIX cleanup handlers. (For other
C compilers, we use a setjmp/longjmp package built on native exceptions "under the
covers", though with some loss of integration when interleaved call frames switch
between the two models. Support for our extensions, or something sufficiently
similar, would allow me to make gcc work properly.) Both cancel delivery and
pthread_exit are implemented as native exceptions. The native unwind mechanism
will unwind all call frames (of whatever origin) and call each frame's handler (if
any) in the proper order. (Another minor glitch is that our exception system has a
single "last chance" handler, on which both we and C++ rely. We set it once at
initialization, but C++ sets it at each "throw" statement, which will break
cancellation or thread exit of the initial thread since we can't put a frame
handler at or below main(). This is also fixed by our not-quite-done integration
effort with C++.)

This is all covered by the IA64 ABI. Of course it specifies API names, and data
structure sizes and contents. It's also somewhat more biased than our
implementation towards C++, since it was a generalization and cleanup of the C++
ABI section on exceptions rather than something designed independently. (The ABI,
and any C++ implementation, had to do this anyway. Making it general was only a
little more work than making it exclusive to C++, and of fairly obvious value.)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
 


=================================TOP===============================
 Q327: pthread_atfork??   
Steve Watt wrote:

> > [ ... ]  Because the thread ththread calls fork() owns all of your locks in the
> >parent, and because that thread (and only that thread) exist in the child, the
> >child thread owns all of the locks when the CHILD handler is called, and it can
> >be sure that all of your data (protected by the locks) is clean and consistent.
>
> I mostly agree with this, except for the statement that the child thread
> owns all the locks.  Specifically, what about error checking mutexes
> a'la UNIX98?  Can the child process really unlock them?  Does an
> implementation have to keep some manner of PID information around so that
> new threads in the child would correctly EPERM?

You're of course welcome to agree or disagree, but be warned that when it comes to
matters of POSIX threads interpretations, suggesting disagreement with me can lead
to long and complicated replies containing detailed analyses and interpretations of
the relevant sections of the standard. You've been warned. ;-)

Yes, I deliberately and carefully said that the child owns the locks. That's what I
meant, which is why I said it.

That is, in the child process, the recorded owner (of any mutex for which owner is
recorded) of a mutex locked by the thread that called fork() in the parent IS the
single thread in the child process. POSIX does not specify that the thread ID of the
single thread in the child is identical to the ID of the forking thread in the
parent, but it does require that any necessary "transfer of ownership" be made
transparent to the application.

If you locked it in a PREPARE handler, you can unlock it in the CHILD handler, no
matter what type of mutex it was.

At least... this is the INTENT of the POSIX working group. Unfortunately, the text
of the standard is somewhat less clear than one might like. Little is said about
what pthread_atfork() does, or how or why you might use it, except in the RATIONALE
section, which is explicitly NOT a binding part of the standard. (It's commentary
and explanations, but can place no requirements on either implementation or
application.) The description of fork() also is not particularly useful because,
despite the clear implication (by having pthread_atfork()), the standard says that
applications may call only "async-signal safe" functions between the return from
fork() (in the child) and a call to one of the async-signal safe exec*() functions,
and mutex operations are not async-signal safe. (But then, technically, the atfork
CHILD handlers, which are called implicitly by the user-mode wrapper of the _fork()
syscall, are not actually called "by the application" after return from fork();
thereby adding yet another level of fuzzy haze to the dilemma.)

What this all means is that we (the working group) didn't spend nearly enough time
reviewing the vast body of the POSIX standard to find words and implications that
should have been changed. Originally, the thread standard was a completely separate
document, though it modified certain sections of 1003.1. That made a thorough review
awkward. Eventually, the 1003.1c amendment text was integrated with the standard. We
found many of the resulting inconsistencies and holes -- but not all of them.
Unfortunately, some areas, like this one, are not mere editorial changes; fixing the
standard to say what we meant could break some implementations that currently
conform to the letter (while violating the spirit).

What this really means is that use of pthread_atfork() may be broken (and unusable)
on some implementations; and those implementations may not be technically
"nonconforming". We were always aware this would occur in some cases, because we
knew we couldn't make the standard perfect. Many such issues that came up were
dismissed as simple matters of "quality of implementation". Nobody, obviously, would
buy a broken implementation. (The flip side, to which we didn't pay sufficient heed,
is that people DO buy broken implementations all the time, or are forced to use such
systems bought by others, learn to accept the limitations, and even expect them of
other systems.)

"Life's not fair."

> What about thread-specific data?  Should the thread in the child get the
> same thread-specific data as the thread that called fork()?  What if
> the result of pthread_self() is different, such that pthread_equal won't
> say they're equal?

The standard doesn't require that the thread ID will be the same, though we assumed
it usually would be. This wasn't an omission. While we said that thread IDs are
private to the process, there was some interest in "not precluding" an
implementation where thread IDs are global. If thread IDs are global, the thread in
the child must have a unique ID. This silly and irrelevant intent, however, has
certain implications, adding to the general "fuzz" around fork(), because it implies
that the ownership information of mutexes would need to be fixed up; but that's not
actually required anywhere. (In fact, this could be considered a technical violation
of the requirement that the child has a copy of the full address space, "including
synchronization objects".)

Nevertheless, in any implementation crafted by "well intentioned, fully informed,
and competent" developers, it must be possible to use pthread_atfork() (with
sufficient care) such that normal threaded operation may continue in the child. On
any such implementation, the thread ID of the child will be the same as in the
parent, all mutexes properly locked in PREPARE handlers will be unlockable in
CHILD handlers, all thread-specific data attached to the forking thread in the
parent will be accessible to the single thread in the child, and so forth.

This may not apply to Linuxthreads, if, (as I have always assumed, but never
verified), the "thread ID" is really just the pid. At least, pthread_self() in the
child would not be the same as in the parent. This is just one of the reasons that
building "threads" on top of processes is wrong; though the nonconformance of the
consequences here are, as I've detailed, somewhat less clear and absolute than in
other places. (Nevertheless, this substantially and clearly violates the INTENT of
the working group, and may render pthread_atfork(), an important feature of the
standard, essentially useless.) The thread library could and should be smart enough
to fix up any recorded mutex (and read-write lock) ownership information, at least;
and TSD should be carried over because it's "just memory" and there's no reason to
do anything else.

> >The CHILD handler may also just unlock and continue, though more commonly it
> >will do some cleanup or reinitialization. For example, it might save the current
> >process pid somewhere, or reset counters to 0.
>
> I generally think that about the only good thing to do in the child
> handler is re-initialize the IPCs.

"The only good thing" to do in CHILD handlers is whatever is necessary to clean up
and get ready for business. If you don't do that, there's no point to even
bothering... in which case you just can't expect to fork() a threaded process at
all.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
 
> Does the std address a forkall() concept vs. fork1()?

There's some rationale (commentary that's not part of the standard) explaining that
forkall was proposed and rejected. It doesn't bother to explain the problems with
forkall.

> > the requirement that the child has a copy of the full address space
>
> Implying _all_ threads too - a "forkall()" concept. Doesn't POSIX
> replace fork() with fork1(), thus the above requirement is
> not violated since the "true" fork() is not called?

"Full address space" doesn't imply "all threads" at all, except perhaps to a "pure user
mode" thread library. Kernel threads don't live in the process address space.

POSIX doesn't "replace fork" with anything. POSIX **defines** fork. Rather, Solaris
"replaces fork" with their proprietary fork1 interface. (Though only, of course, in
nonstandard compilation environments.)

The concept of "forkall" is foolish. You can't arbitrarily replicate execution streams
unless you can tell them what happened, and there's simply no way to do that. (Solaris
allows that threads in blocking syscalls "might" return EINTR, but that's all, and it's
not nearly enough.) With a single execution context for each process, fork was just fine,
because the execution stream asked for the copy, and knows what happened on both sides of
the fork. When you have multiple independent execution contexts, you have to deal with
the fact that you don't know what any other context is doing, and it doesn't know what
you're doing.

A lot of new mechanism would have to be invented, and many complicated constraints added,
to make "forkall" a useful interface. Each cloned execution context would need to be
immediately notified, and it would need to be able to "clean up" in whatever way
necessary, including terminating itself. This might be done by delivering a signal, but
much of the cleanup likely to be necessary (and thread termination) cannot be done in a
signal handler.

Forkall was proposed. We discussed it a lot. We dismissed it as far too complicated, and
way beyond any rational interpretation of the working group's scope and charter. To some
people who don't look deeply, forkall seems "simpler" than pthread_atfork; but it is
actually vastly more complicated. Unless you don't care about correctness or usability.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q328: Does GNU Pth library support process shared mutexes? 


CoreLinux++ WILL support process shared mutexes through a combination of
shared memory and semxxx. This will take a few weeks to implement and will
require that all applications needing this will require using the
libcorelinux++ libraries.

It is also C++.

Frank V. Castellucci
http://corelinux.sourceforge.net
=================================TOP===============================
 Q329: I am trying to make a thread in Solaris to get timer signals. 


I am trying to make a thread in Solaris to get timer signals
every second. I am using setitimer() and sigwait() to set up
and catch the signals respectively.

I am sorry to tell you that setitimer()/sigwait() does not work
with the threads/pthreads library on Solaris.  I won't go into
the details, but it is a sorry tale.

To make a thread do a periodic action, use cond_timedwait()
on a dummy cond_t/mutex_t pair that has no other function
than to be used in the call to cond_timedwait().

The thread will wakeup at the time you specify.
It can then do the periodic thing and reissue the
cond_timedwait() to wait another interval.

Roger Faulkner
[email protected]

=================================TOP===============================
 Q330: How do I time individual threads? 


I am getting very puzzling behavior using 2 threads on a 2 processor Solaris
computer. I am new to Solaris, so I am probably just doing something stupid.

I am writing code to parallelize some numeric computations. It was
originally written on NT, and I am porting it to Solaris. I am using a dual
processor Dell NT and a dual processor Sun Solaris for development. The
threads are very simple, and can operate completely independently. The code
ports easily from the point of view of compiling, linking, and executing.
However, on NT, I get over 90% speedup in using two threads, but on Solaris
I get almost none (at most about 15%).

Simplified example code is shown below. 
double GetTime()
{
   return ((double)clock())/CLOCKS_PER_SEC;
}


Not stupid, just a misinterpretation of clock(3C).

This is from the clock(3C) manual page:

  DESCRIPTION
     The clock() function returns the  amount  of  CPU  time  (in
     microseconds)  used  since  the first call to clock() in the
     calling process.

What you get from clock() is the CPU time used by all threads
in the process since the last call to clock().
To do your timing, you want to get the elapsed time.

This is my modification to testth.cpp (times() returns the number
of ticks (HZ) since some time in the past):

#include 
#include 
#include 
...
double GetTime()
{
    struct tms dummy;
    return (times(&dummy;))/(double)(HZ);
}
 
Roger Faulkner
[email protected]

=================================TOP===============================
 Q331: I'm running out of IPC semaphores under Linux! 

 

>>>>> "Doug" == Doug Hodson  writes:

    Doug> Now that I have IPC semaphores working under Linux, I am
    Doug> running into another problem. I'm running out of them!!! Do
    Doug> I have to recompile the kernel to make more available?

echo 500 > /proc/sys/kernel/sem

Replace 500 with whatever you want.  (untested)
=================================TOP===============================
 Q332: Do I have to abandon the class structure when using threads in C++? 


> in C++ much easier. Currently to use threads in C++ you have to
> virtually abandon the class structure and the type checking and
> resort to low level hacking when using threads.

No, that's incorrect.  This problem only occurs if you do not
understand the appropriate patterns and idioms for effective
multi-threaded programming in C++.  We've been developing and
deploying high-performance and real-time OO applications in C++ for
the past decade, and there's now plenty of "collective wisdom" on how
to do this properly and abstractly using lots of nice high-level C++
features.  I recommend that you check out the following resources for
more information:

http://www.cs.wustl.edu/~schmidt/Concurrency.ps.gz
http://www.cs.wustl.edu/~schmidt/ACE-papers.html
http://www.cs.wustl.edu/~schmidt/patterns-ace.html
http://www.cs.wustl.edu/~schmidt/patterns/patterns.html

All of these resources are based on the threading abstractions
provided with ACE, which is a freely-available, open-source framework
that defines a rich source of components for concurrent and network
programming.  You can learn download at from.

http://www.cs.wustl.edu/~schmidt/ACE.html

BTW, I'm teaching a short-course at UCLA in a couple weeks that'll 
cover this all material in depth.  You can download the course notes
and learn more about the course itself at

http://www.cs.wustl.edu/~schmidt/UCLA.html

Take care,

        Doug 
=================================TOP===============================
 Q333: Questions about pthread_cond_timedwait in linux. 


>i've been programming threads for years and am just moving to linux
>(redhat) and have questions about pthread_cond_timedwait:
>
>pthread_cond_timedwait takes a struct timespec * and looking at the
>example in the doc, the fields are initialized from the fields
>in gettimeofday plus an offset.  that raises the following questions:
>  what happens if the offset puts the tv_sec over the maximum value
>  for that day?

The tv_sec field is the number of seconds since the epoch. If this overflows,
then the year must be 2037, and you aren't using the latest 256 bit hardware. 

:) :) :)

>  What happens if the clock is changed? (like a dst adjustment)

The Linux implementation of pthread_cond_timedwait converts the absolute time
to a relative wait, which is then the subject of a nanosleep() call (with the
delivery of a signal cutting that sleep short when the condition is cancelled).
The nanosleep system call in Linux is based on the system clock tick and not on
calendar time. So the answer is that changing the system time will have no
effect on when the call wakes up. Namely, moving the date forward will not
cause an immediate wakeup.

However, if an unrelated signal (not due to the condition wakeup) interrupts
the pthread_cond_timewait, it will call gettimeofday() again and recompute
the relative wait. At that time it may wake up due to the date change.

=================================TOP===============================
 Q334: Questions about using pthread_cond_timedwait. 


I need your help to clarify something...

Consider the following code:

void foo()
{
struct timeval tv;
struct timespec ts;

if (gettimeofday(&tv;, NULL) < 0)
    // error handling stuff here

// Convert and store to structure that pthread_cond_timedwait wants
ts.tv_sec = tv.tv_sec;
ts.tv_nsec = tv.tv_usec * 1000;

// Add 10 milli-sec (this is how long I want to wait)
ts.tv_nsec += 10 * 1000 * 1000

while (MyPredicate == false_
    {
    status = pthread_cond_timedwait(&condvar;, &mutex;, &ts;);
    // do stuff depending on status
    }

// Other stuff goes here
}

The problem is that I get lots of ETIMEDOUTs...

Here come the questions:

1) On a normal PC (single processor) running linux,
what is the minimum time I can wait???
I assume 10 milli-sec is ok...

2) On the other end of the scale, what is the max time
I can wait ???  e.g. can I put 300 milli-sec
(i.e. ts.tv_nsec += 300 * 1000 * 1000)???

I am asking because gettimeofday will return the time
in a timeval.  If I just increase the usec and not the
seconds, are there overflow problems ???

if tv_tv.sec is X and tv_tv.usec is 999.999,
if I increase by 100.000 is that going to keep the
seconds the same and go to 099.999, or is
it clever enough to either

increase X to X+1 OR make usec equal to 1.099.999 ???

What I am thinking is that the ETIMEDOUTs might be
because the new time ends up being EARLIER that
the current time.
 

pthreads conditional waits use an absolute time to specify the timeout
not a relative time. In general you will get the _time now_ and add some
delta to determine the absolute time corresponding to the relative
timeout delta that you wish.

That's the theory. In practice system operators can totally screw you up
by adjusting the clock which changes the machine's notion of the current
absolute time. There isn't an awful lot that you can do about this
problem except...

Certain versions of Unix provide clock_gettime; among those versions of
Unix some will support CLOCK_MONOTONIC, a type of clock that alawys
advances at the same rate regardless of changes to the machine's
absolute clock. A monotonic clock will very useful to use in conjunction
with relative timeouts.

The trouble with this is that while the monotonicity of the clock used
for conditional waits is the default, it seems to be associated with the
condition variable attribute. How then, are you supposed to compute the
timeout value? My guess is clock_gettime with CLOCK_MONOTONIC + delta
should be used but I can't be sure. Also, what happens if the condition
variable attribute is initialized to specify a non-monotonic clock and
we use a monotinic clock to compute the timeout?

If anybody has up to date information on this I'd like to hear about it.

 
>> >// Add 10 milli-sec (this is how long I want to wait)
>> >ts.tv_nsec += 10 * 1000 * 1000
>>
>> The problem with this statement is that it may potentially increase
>> the value of tv_nsec beyond one billion less one, thus giving
>> rise to an invalid struct timespec.
>>
>
>Just to clarify, the tv_usec field (although a long) will only go up to
>999.999 (max value gettimeofday will return for usec).

Or,   999,999   for those of us whose locale calls for , as a digit separator
symbol. 

;)

>Since a long goes up to 2.xxx.xxx.xxx, if I go above 999.999
>this is considered an illegal value...

Yes.

>And also when I convert to a timespec to use with pthread_cond_timedwait,
>although again the tv_nsec field is a long, I am only allowed to
>go up to 999.999.000 (or 999.999.999 ???)

Yes, up to 999999999.

>If yes, how come the conditional variable returns with an
>ETIMEDOUT and NOT with a EINVAL ???

Because the behavior is simply undefined when you pass a bogus timespec;
undefined means that any response is possible, including ETIMEDOUT.

The Single UNIX Specification does not require the pthread_cond_timedwait
function to detect bad timespec structures.  If the programmer has taken
care that the structures have valid contents, checking them is just
a waste of cycles; and the progrmamer who gets them wrong will likely
also ignore the return value of pthread_cond_timedwait().

See

http://www.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_timedwait.html
=================================TOP===============================
 Q335: What is the relationship between C++ and the POSIX cleanup handlers? 


>Ian Collins wrote:
>> 
>> Stefan Seefeld wrote:
>> >
>> > Ian Collins wrote:
>> >
>> > > Cleanup handlers are not a good place to destroy objects, use some
>> > > for of container object that to the object and delete it in its
>> > > destructor
>> > > (a smart pointer) to do this.
>> >
>> > that would indeed be the right thing, if....if thread cancelation
>> > would do proper stack unwinding.
>> >
>> > Stefan
>> 
>> I would hope it does - the Solaris one does.
>
>Only if you use Sun's own compiler and even then - not always.

From what I understand, in this compiler, when a thread is canceled, it acts as
if some special exception was thrown which is never caught and which does not
call unhandled()---in other words, just the unwinding is performed.

What is the relationship between this unwinding and the POSIX cleanup handlers?

Are these properly interleaved with the unwinding, or are they done first?

In other words, what I'm asking is: are the POSIX cleanup handlers somehow
hooked into the destructor mechanism so that if I have this
construct:

    {
    Object foo;

    pthread_cleanup_push(A, ...)

    {
        Object bar;

        pthread_cleanup_push(B, ...)

        pthread_exit(PTHREAD_CANCELED);

        pthread_cleanup_pop(1);

    }

    pthread_cleanup_pop(1);
    }

what happens? Ideally, handler B() would get called, then the destructor of
object bar, followed by the handler A, and then the destructor of object foo.

For that to work, the cleanup handlers would have to be hooked into the
unwinding mechanism of the C++ implementation, rather than on a separate stack.

E.g. pthread_cleanup_push(X, Y) would be a macro which expands to something
like:

    {
    __cleanup_obj __co(X, Y);

where __cleanup_obj is a class object with a destructor which calls the
handler---except that some compiler extensions are used to make this work in C
as well as C++.

I know there are ways to add these kinds of hooks into GCC.  I'm thinking that
it would be nice to add this support to LinuxThreads, and it would also be
nice if it was compatible with some existing good or popular scheme.

LinuxThreads currently keeps its own chain of cleanup nodes, but there is no
reason not to use some GCC extensions to get it to use the destructor mechanism
instead (other than breaking binary compatibility; i.e this would have
to wait until glibc 2.2.x, and support for 2.1.x and 2.0.x cleanup
handling would have to be retained.)

=================================TOP===============================
 Q336: Does selelct() work on calls recvfrom() and sendto()? 


>>> I hope that some one can help me see the light.
>>> 
>>> Assume:
>>> 1. A socket based server.
>>> 2. On a client connection server creates a child-server thread to
>>>    take care of this clinet.
>>> 3. Child-server implements a retransmission of packet on negative
>>>    ACK (uses alarm signal for time out)
>>
>>Why not use select() with a timeout to block each child-server thread 
>>instead of alarm?
>
>Does selelct() work on calls recvfrom() and sendto()? I am under the
>impression it only works on connection oriented sockets accept(),
>read(), write(), recv(), send() etc. Please give a simple sketch of
>the usage?

No, select works on datagram sockets as well. On UNIX-like systems, select also
works on other kinds of objects: regular files (not really useful there),
terminal devices, printer ports, etc.

It works pretty much the same way on datagram sockets as it does on stream
sockets. Read availability means there are one or more datagrams waiting. Write
availability means there is buffer space for datagrams.
=================================TOP===============================
 Q337: libc internal error: _rmutex_unlock: rmutex not held. 

>Hi,
>
>I'm writing a distributed application using Iona's Orbix ORB and RogueWaves
>ToolsPro class libraries. Since the application is multi-threaded, I'm using
>RogueWave's Threads.h++ also. The application is build over POSIX threads.
>When I run the application I get the following error:
>
>libc internal error: _rmutex_unlock: rmutex not held.
>
>...and the application just hangs there. I have tried moving to Solaris
>threads
>instead, but of no good use. I tried some sample thread programs but
>they all
>worked fine.
>
>Is there something I'm missing? A quick reply or advice will be greatly
>appreciated as the deadlines are short and customer not in a good mood
>:-)

The message you are getting indicates an internal inconsistency
in libc.  The standard I/O implementation (in libc) uses mutex
locks to protect its internal data structures (the stuff behind
the FILE struct).  The message is saying that some thread that is
doing standard I/O attempted to unlock a lock that it does not own.
This, of course, "cannot happen".

It could be caused by the application (overwriting memory)
or it could be an inconsistency between libc and libthread
(caused by linking libc statically but libthread dynamically
[you can do static linking with libc but there is no static
libthread]) or the application could be defining its own
_thr_main() function that subverts the ones in libthread and libc.

To make any progress on the problem, I'd need a test case that
exhibits the problem.  I have to admit that I know nothing about
the other software you are using (Iona's Orbix ORB and RogueWave's
ToolsPro class libraries and RogueWave's Threads.h++) but they
might interfere with the proper working of libthread and libc
(I'm just speculating here, not accusing).  And, of course, there
could be a bug somewhere in libthread/libc.

One thing you could do to discover more about the problem would
be to apply a debugger (adb or dbx or gdb) to the hung process.
Also you can get stack traces of all the threads in the process
by applying the pstack command to the hung process:
    $ /usr/proc/bin/pstack 

What release of Solaris are you running?
Would it be possible for you to send me your program?
Maybe we should just continue this privately via e-mail
rather than on the newsgroup.  Feel free to send me mail.

Roger Faulkner
[email protected]
===== 
From: Boris Goldberg  

It may happen due to incorrect order of linking with libc and libthread.

You must link with libthread before libc. That can be ensured by
specifying
-mt flag on lin line.


do ldd on your program: if you see libthread after libc, that's your
problem

=================================TOP===============================
 Q338: So how can I check whether the mutex is already owned by the calling thread? 


On Mon, 27 Mar 2000 12:08:54 GMT, [email protected]  wrote:
>Thanks for all your qualified contributions.
>This is what I've learned:
>If I want to abide by the POSIX standard on UNIX platforms I'd better
>drop the habit of using recursivly lockable mutexes. OK, so be it. But
>I'd really love to port a lot of existing C++ code and use it on Linux.

Obviously. Any sort of proscription against recursive mutexes must be weighed
against the pressing need to port a whole lot of code that needs them.

>So how can I implement my Mutex-Lock-class in a way that it checks
>whether the mutex is already owned by the calling thread?

Very easily. The class simply has to store a counter and the ID of the
owning thread. These can be protected by an additional internal mutex.

>It looks to me that if I just put in an additional boolean flag, no
>thread can safely check this flag because it may be changed
>simultanously by another thread.
>Given a mutex class "NThreads::Mutex" (that used to be recursivly
>lockable), the class NThreads::MutexLock has been implemented as you
>can see below (abbreviated). How can I change it to make it work with a
>non-recursive Mutex class?
>
>namespace NThreads
>{
>class Mutex
>{
>  friend class MutexLock;
>public:
>    Mutex();
>   ~Mutex();
>    bool lock( int timeout )
>    {
>      //return true if not timed out
>    }

These kinds of strategies aren't all that useful, except for debugging
assertions.  About all you can do in the case of such a timeout is to log an
error that a deadlock has probably occured and then abort the application.
It is an internal programming error that is not much different from a
bad pointer dereference, or divide by zero, etc.

>    void unlock()
>    {
>      // unlock system mutex
>    }
>private:
>    void lock()
>    {
>      // lock with infinite timeout, no need for return value, but
>dangerous.
>    }
>};
>
>class MutexLock
>{
>  public:
>MutexLock( Mutex& mtx ) : rMtx_( mtx )
>{
>  rMtx_.lock();
>}

I see, this is just one of those safe lock classes whose destructor
cleans up.  

It is the Mutex class that should be made recursive, not the safe lock
wrapper, as in:

    #ifdef USE_POSIX_THREADS

    class Mutex {
        pthread_mutex_t actual_mutex_;
        pthread_mutex_t local_mutex_;
        pthread_t owner_;
        int recursion_count_;
    public:
        Mutex();
        ~Mutex();
        void Lock();
        void Unlock();
        void Wait(Condition &);
    };

    #endif

    #ifdef USE_OS2_THREADS

    // definition of Mutex class for OS/2

    #endif

The methods definitions for the POSIX variant would look something like this:

    Mutex::Mutex()
    : recursion_count_(0)
    {
    pthread_mutex_init(&actual;_mutex_, NULL);
    pthread_mutex_init(&mutex;_, NULL);
    // leave owner_ uninitialized
    }

    Mutex::~Mutex()
    {
    assert (recursion_count_ == 0);
    int result = pthread_mutex_destroy(&actual;_mutex_);
    assert (result == 0);
    int result = pthread_mutex_destroy(&local;_mutex_);
    assert (result == 0);
    }

    void Mutex::Lock()
    {
    pthread_mutex_lock(&local;_mutex_);

    if (recursion_count_ > 0 && pthread_equal(pthread_self(), owner_)) {
        assert (recursion_count_ < INT_MAX); // from 
        recursion_count_++;
    } else {
        pthread_mutex_unlock(&local;_mutex_);
        pthread_mutex_lock(&actual;_mutex_);
        pthread_mutex_lock(&local;_mutex_);
        assert (recursion_count_ == 0);
        recursion_count_ = 1;
        owner_ = pthread_self();
    }

    pthread_mutex_unlock(&local;_mutex_);
    }

    void Mutex::Unlock()
    {
    pthread_mutex_lock(&local;_mutex_);

    assert (pthread_equal(pthread_self, owner_));
    assert (recursion_count_ > 0);

    if (--recursion_count_ == 0)
        pthread_mutex_unlock(&actual;_mutex_);
    
    pthread_mutex_unlock(&local;_mutex_);
    }


Or something along these lines. I haven't tested this code.   I did make sure
that wherever both locks are  held, they were acquired in the same order  to
prevent the possibility of deadlock. It's more or less obvious that you must
never try to acquire the actual mutex while holding the local one.

A condition wait requires special trickery:

    void Mutex::Wait(Condition &cond;)
    {
    pthread_mutex_lock(&local;_mutex_);

    assert (pthread_equal(pthread_self, owner_));
    assert (recursion_count_ > 0);
    int saved_count = recursion_count_;
    recursion_count_ = 0;

    pthread_mutex_unlock(&local;_mutex_);

    pthread_cond_wait(&cond.cond;_, &actual;_mutex_);

    pthread_mutex_lock(&local;_mutex_);

    assert (recursion_count_ == 0);
    recursion_count_ = saved_count;
    owner_ = pthread_self();

    pthread_mutex_unlock(&local;_mutex_);
    }

I hope you can massage this into something that works. If I messed up, flames
will ensue.

    -------------------------- 

As a followup to my own posting, I want to make a remark about this:

>A condition wait requires special trickery:
>
>    void Mutex::Wait(Condition &cond;)
>    {
>    pthread_mutex_lock(&local;_mutex_);
>
>    assert (pthread_equal(pthread_self, owner_));
>    assert (recursion_count_ > 0);
>    int saved_count = recursion_count_;
>    recursion_count_ = 0;
>
>    pthread_mutex_unlock(&local;_mutex_);
>
>    pthread_cond_wait(&cond.cond;_, &actual;_mutex_);
>
>    pthread_mutex_lock(&local;_mutex_);
>
>    assert (recursion_count_ == 0);
>    recursion_count_ = saved_count;
>    owner_ = pthread_self();
>
>    pthread_mutex_unlock(&local;_mutex_);
>    }

Firstly, there is no condition checking while loop around the pthread_cond_wait
because it is assumed that the caller of Mutex::Wait() will implement the
re-test. The intent here is only to wrap the call. Thanks to John Hickin
for raising this in an e-mail.

Secondly, because pthread_cond_wait is a cancellation point, it is necessary
to deal with the possibility that the waiting thread may be canceled. If that
happens, the actual_mutex_ will be locked by the canceled thread, but the state
of the owner_ and recursion_count_ will not be properly recovered.   Thus
the user of the class has no recovery means.

This requires a messy change, involving an extern "C" redirection function
which calls a method that does mutex reacquire wrapup.  There is a need
to communicate the saved recursion count to the cleanup handler, as well
as the identity of the mutex object, using a single void * parameter, so 
a context structure is introduced:

    struct MutexContext {
    Mutex *mtx_;
    int saved_count_;
    MutexContext(Mutex *m, int *c) : mtx_(m), saved_count_(c) { }
    };

The cleanup handler is then written, which takes the context and
calls the object, passing it the saved count:

    extern "C" void Mutex_Cancel_Handler(void *arg)
    {
    MutexContext *ctx = (MutexContext *) arg;

    ctx->mtx_->CancelHandler(ctx->saved_count_);
    }

The code that is executed at the end of the old version of Mutex::Wait
is moved into a separate method. This assumes that actual_mutex_ is
locked on entry, which is the case if the pthread_cond_wait is canceled.

    void Mutex::CancelHandler(int saved_count)
    {
    // actual_mutex_ is locked at this point

    pthread_mutex_lock(&local;_mutex_);

    assert (recursion_count_ == 0);
    recursion_count_ = saved_count;
    owner_ = pthread_self();

    pthread_mutex_unlock(&local;_mutex_);
    }

Finally, Wait() is revised to look like this:

    void Mutex::Wait(Condition &cond;)
    {
    pthread_mutex_lock(&local;_mutex_);

    assert (pthread_equal(pthread_self, owner_));
    assert (recursion_count_ > 0);

    MutexContext context(this, recursion_count_);
    recursion_count_ = 0;

    pthread_mutex_unlock(&local;_mutex_);

    // Ensure cleanup takes place if pthread_cond_wait is canceled
    // as well as if it returns normally.

    pthread_cleanup_push(Mutex_Cancel_Handler, &context;);

    pthread_cond_wait(&cond.cond;_, &actual;_mutex_);

    pthread_cleanup_pop(1);
    }
=================================TOP===============================
 Q339: I expected SIGPIPE to be a synchronous signal. 


> >Using Solaris threads under Solaris 5.7.
> >
> >I would have expected SIGPIPE to be a synchronous signal when it
> >occurs as a result of a failed write or send on a socket that has
> >been disconnected.  Looking through past articles in Deja seemed to
> >confirm this.
> >
> >However, I thought I would undertake the radical idea of actually
> >testing it.  In my tests it looks as if it's an asynchronous signal.
> 
> Yes, it is an asynchronous signal in Solaris.
> This is not a bug in Solaris; it is intentional.
> 
> The purpose of SIGPIPE is to kill a process that is writing
> to a pipe but that has made no provision for the pipe being
> closed at the other end.
> 

On HP-UX, SIGPIPE is a synchronous signal and one shouldn't even try
'sigwait'-ing for it. Sounds logical too. Any reason why it's different
on Solaris7? The above paragraph didn't seem like a very convincing
reason. 

Thanks,

-- Rajiv Shukla

> If you want to deal with a pipe or socket being closed, then either
> mask SIGPIPE or catch it with a do-nothing signal handler and test
> the errno that comes with a failed write() or a send() operation.
> If it is EPIPE, then that corresponds to SIGPIPE, and you have
> gotten the answer synchronously.


In Digital (Tru64) Unix, we made SIGPIPE a synchronous signal,
and I still believe that's the right disposition for it. Uncaught,
I will terminate the process. Caught, it allows corrective
action to occur in the thread that cares about the broken
connection. Useful? Barely. More accurate? Much.

That aside, the best thing to do with SIGPIPE is to
set it  to SIG_IGN and pick up the EPIPE error return
on the write() call. Masking/catching the signal
isn't the right thing to do if you don't care about
the signal, and you most likely don't.
It's cheapest to ignore it and move on.

Jeff
=================================TOP===============================
 Q340: I have a problem between select() and pthread... 


>Hi! everyone..
>
>I have a problem that is the syncronization between select() and pthread...
>
>That is as follows...
>
>the main thread is blocking in select() func.
>and at the same time, the other thread is closed a socket descriptor in
>fd_set..  this work causes a EBADF error in select().
>so, I wrote in main thread:
>
>SELECT_LABEL:
>    if ((nready = select(nfds, readfds, writefds, exeptionfds)) == -1) {
>        if (errno == EBADF) goto SELECT_LABEL;
>        perror("select()");
>    }
>
>But that is not solved...
>after goto syntax, I got infinitely EBADF error in select().
>
>How do I for solving that???
>after select(), close a socket descriptor??
>or only *ONE* thread controls socket descriptors??
>
>I use the POSIX thread on Solaris 7..

You have to figure out in the main thread which file descriptor
was closed by the other thread and delete its bit from the fdset's
before reissuing the select().  The select interface() itself
will not help you to determine this.

In Solaris, select() is implemented on top of poll(2).  If you
use the poll() interface directly, then a closed file descriptor
will show up in the array of pollfs's with revents containing the
POLLNVAL bit.  Thus the poll() interface will tell you which file
descriptor has been closed and you can stop polling on it.

Roger Faulkner
[email protected]
=================================TOP===============================
 Q341: Mac has Posix threading support. 


> I'm looking at a cross-platform strategy for our application.
> Threads was one issue which came up, and Posix threads seems like a good
> prospect.

> It is supported under Windows (http://sourceware.cygnus.com/pthreads-win32/)
> and Unix, but I don't think Mac has Posix threading support.

I'm maintaining a free (nonpreemptive) pthreads library, available at

ftp://sunsite.cnlab-switch.ch/platform/macos/src/mw_c/GUSI*

Matthias

-- 
Matthias Neeracher      http://www.iis.ee.ethz.ch/~neeri
   "I really don't want the SNMP agent controlling my toilet to tell
    someone when/where I'm using it." -- Sean Graham

=================================TOP===============================
 Q342: Just a few questions on Read/Write for linux. 


>Just a few questions on Read/Write
>lock stuff since man pages don't exist (yet)
>for linux.
>
>1) Where can I find documentation, sample code,
>or anything else that will help (eg. URLs etc.)

These locks are based on The Single Unix Specification.  
http://www.opengroup.org/onlinepubs/007908799/

>2) Can I treat the rwlock stuff same as a mutex
>in terms of init/destroy/lock/unlock/trylock ???
>I had a look at pthread.h and all the calls look
>the same... (Is it basically a mutex that allows
>multiple locks for readers?)

Something like that.

>3) What's the story with overhead if you start using
>r/w locks?

In Linux, there is somewhat more overhead compared to mutexes because the locks
are more complex.  The structures and the operations on them are larger.

Also, as of glibc-2.1.3, each thread maintains a linked list of nodes which
point to the read locks that it owns. These nodes are malloced the first time
they are needed and then kept in a thread-specific free list for faster
recycling. The nodes of these lists are destroyed when the thread terminates.

Each time a read lock is acquired, a linear search of this list is made to see
whether the thread already owns the read lock. In that case, a reference count
field is bumped up in the linked list field and the thread can proceed.

(The lists are actually stacks, so that a recently acquired lock is at the
front of the list.)

This algorithm is in place in order to implement writer-preference for locks
having the default attribute, while meeting the subtleties of the spec with
respect to recursive read locks.

The prior versions of the library purported to implement writer preference,
but due to a bug it was actually reader preference.

>4) If you have many readers could that mean that the
>writer will never get a chance to lock, or are the
>locks first-come-first-serve ???  I'm thinking

Writer preference, subject to the requirements of The Single UNIX Specification
which says that a thread may recursively acquire a read lock unconditionally,
even if writers are waiting.

In glibc-2.1.3, LinuxThreads supports the non-portable attribute

    PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP

which gives you more efficient writer preference locks, at the cost of
not supporting recursive read locks. These kinds of locks do not participate
in the aforementioned linked lists. If a writer is waiting on a lock,
and a thread which already has a read lock tries to acquire another one,
it simply deadlocks.

>(I know it's probably dim but...) if a reader can
>always lock, there might be a case where there is
>always at least one reader on the mutex.  What
>happnes if a writer comes along and rwlocks ???

If you read the spec, you will note that this is implementation defined.  An
implementation may, but is not required to, support writer preference.
The Linux one does (now).
=================================TOP===============================
 Q343: The man pages for ioctl(), read(), etc. do not mention MT-safety. 


>
>But so far I do have an implementation in mind, and
>I have learned enough to check if any library
>functions I will call are MT-safe.  And so I 
>started checking man pages, and to my horror
>found that the man pages for such indespensable 
>familiars as ioctl(), read(), and write() do 
>not mention this issue.
>
>(messy complication: I'm looking at man pages on
>SunOS, but the project will be on Linux.  I don't
>have a Linux account yet.  Bother, said Pooh)

On Solaris, everything in section 2 of the manual pages
(that is, system calls, not general C library functions)
is thread-safe unless explicitly stated otherwise.
Sorry that the man pages are not more clear on this point.

I can't speak for Linux.

Roger Faulkner
[email protected]
=================================TOP===============================
 Q344: Status of TSD after fork()? 



>OK, here's an ugly scenario:
>
>Imagine that you're some thread running along, you've got some reasonable
>amount of stuff stashed away in pthread_{get,set}specific[1].
>
>Now you call fork().
>
>Those who have read the POSIX standard know that "If a multithreaded
>process calls fork(), the new process shall contain a replica of the
>calling thread and its address space...  Consequently ... the child
>process may only execute async-signal safe operations until ... one of
>the exec functions is called."
>
>So, the process is using pthread_*, but it hasn't called pthread_create(),
>so it doesn't really count as a multithreaded process, right?  (Well, I'm
>using that as an assumption at the instant.)

I can't speak for other implementations, but with Solaris pthreads,
the child of fork() is a fully-fledged multithreaded process that
contains only one thread, the one that performed the fork().
It can continue doing multithreaded stuff like create more threads.
Of course, there are the standard caveats that apply to fork(),
like the process must have dealt with its own locks by appropriate
use of pthread_atfork(3THR) or some other mechanism.

>Now for the hard part:  Does pthread_self() return the same value for the
>thread in the child process as it did in the parent for the thread that
>called fork()?  This has implications on thread-specific data, in that
>the definition of "the life of the calling thread" (POSIX 1003.1:1996
>section 17.1.1.2, lines 15-16) would be assoicated (in my mind) to the
>result of pthread_self().

On Solaris, in the child process, pthread_self() returns 1
(the thread-ID of the main thread) regardless of the value of
the thread-ID of the corresponding thread in the parent process.

>So what I'm looking for is opinions on:
>  A)  Should thread-specific data be replicated, or
>  B)  Should all pthread_getspecific keys now return NULL because it's a
>      new thread in a different process?
>Ugh.  Implementor opinions welcome, as well as users.

On Solaris, the thread-specific data of the forking thread
in the child process is replicated.
Should it?  I think so, but you must ask the standards bodies.

>[1]I like to think of pthread_{get,set}specific as (conceptually) indexing
>   a two-dimensional array that is addressed on column by the result of
>   pthread_self(), and the row by pthread_key_create()'s return.

You should stop thinking this way.  The thread-ID is an opaque object;
it is not to be interpreted as an index into anything.  You should
think of pthread_{get,set}specific as being indexed by the thread
(its register set if you wish), not by its thread-ID.

Roger Faulkner
[email protected]

 
=================================TOP===============================
 Q345: Static member function vs. extern "C" global functions? 


Do I have to? Oh well here goes....

This still uses a nasty cast.  It is also not a good idea to
start a thread in a constructor for the simple reason that
the thread may run _before_ the object is constructed - this
is even more likely if this a base class - I know, I've been
there and done that.

Use an extern "C" friend as in the following compete example:

#include 
#include 

extern "C" void* startIt( void* );

class Fred
{
  pthread_t tid;

  friend void* startIt( void* );

  void* runMe() throw() { std::cout << "Done" << std::endl; return NULL; }

public:

  int start() throw() { return pthread_create( &tid;, NULL, startIt, this ); }

  pthread_t id() const throw() { return tid; }
};

void* startIt( void* p )
{
  Fred* pF = static_cast(p);

  return pF->runMe();
}

int main()
{
  Fred f;
  int  s;

  if( (s = f.start()) )
    return s;

  std::cout << "Started" << std::endl;

  void* status;

  pthread_join( f.id(), &status; );

  pthread_exit( 0 );
}

Warwick Molloy wrote:

> Hi,
>
> What's the difference between a static member function and extern "C" global
> functions?
>
>     name mangling
>
> All C++ code is linked with a regular C linker.  That's why you need name
> mangling to allow such things as overloading etc.
>
> If you want to get an extern "C" pointer to a static member function, do this
>
> extern "C" {
>     typedef void* (*extern_c_thrd_ptr)( void *);
> }
>
> class floppybunny {
>
>     void worker_func( void );
>
>     static void* foo_func( void *p)
>     {
>         floppybunny* ptr =(floppybunny*)p;
>
>         ptr -> worker_func();  // convert explic this pointer to implied this
> pointer.
>     }
>
>     floppybunny( void )
>     {
>         pthread_create( &tid;, (extern_c_thrd_ptr)foo_func, (void*)this);
>     }
> };
>
> QED
>
> That makes the thread function nicely associated with your class and best of
> all...
>
>                         IT WORKS.
>
> Regards
> Warwick.  (remove the spam to reply)
>
> Ian Collins wrote:
>
> > Timmy Whelan wrote:
> >
> > > You can also make the member function static:
> > >
> >
> > For the Nth time, static members are _NOT_ the same as extern "C"
> > functions.
> > Thier linkage may be different.  Use a friend defined as extern "C" or make
> > the
> > real start member public.
> >
> >     Ian
> >
> > >
> > > class foo
> > > {
> > > public:
> > >         static void *startThread(void *param);
> > >
> > >         void *actualThreadFunc( );
> > > };
> > >
> > > void *
> > > foo::startThread( void *param )
> > > {
> > >         foo *f = (foo *)param;
> > >         return f->actualThreadFunc( );
> > > }
> > >
> > > If you need to pass in parameters, use member variables.
> > >
> > > "Mr. Oogie Boogie" wrote:
> > > >
> > > > Howdy,
> > > >
> > > > How does one make a C++ class member function as the starting function
> > > > for a thread?
> > > >
> > > > I keep getting the following warning and have been unable to find any
> > > > documentation/source to get rid of it.
> > > >
> > > > slm_th.cc: In method `slm_th::slm_th(char * = "/dev/tap0")':
> > > > slm_th.cc:98: warning: converting from `void * (slm_th::*)(void *)' to
> > > > `void * (
> > > > *)(void *)'
> > > >
> > > > This is the class:
> > > >
> > > > class slm_th {
> > > >   public:
> > > >     void *Read(void *arg);
> > > > }
> > > >
> > > > void *slm_th::Read(void *arg) {
> > > > ...
> > > > }
> > > >
> > > > Thanks,
> > > >
> > > > -Ralph
> > >

One minor point: Calling convention is, in the general case, a
compiler-specific thing and not an operating-system-specific thing.  Different
compilers for the same operating system can easily have calling conventions
for functions with "C" or "C++" linkages that are incompatible.  

(Some platforms/operating systems have an ABI standard that defines the C
language calling conventions for the platform and operating system.  This is
not universally the case, however.  It is especially not the case for x86
platforms running non-Unix operating systems.)

=================================TOP===============================
 Q346: Can i kill a thread from the main thread that created it? 

  
>can i kill a thread from the main thread that created it?
>under Windows, i only found the CWinThread::ExitInstance () method,

You can kill a thread with TerminateThread().

Using TerminateThread is really, really, really, really, not recommended.  If
thread owns a critical section the critical section is not released and it
will forever be unaccessable.  If other threads then try to enter it they
will hang forever.  Also, the stack allocated to the thread is not released
and various other bad things can happen.

If you think you need to use TerminateThread it's a good sign that your
threading design is broken.  You should be telling the thread to exit itself.

Figuring out how to call TerminateThread using MFC'isms such as CWinThread is
left as an exercise to the reader.

    -Mike

> Also, the stack allocated to the thread is not released
> and various other bad things can happen.

Yes, it's that bad... A while ago I started writing an app that used
TerminateThread() - it leaked about a megabyte per second under load =).

> If you think you need to use TerminateThread it's a
> good sign that your threading design is broken.
> You should be telling the thread to exit itself.

I don't agree 100%; I've encountered several situations where it would be
very handy to kill a thread (think about a long-running computation whose
results you aren't interested in anymore). Pthreads has a nice design - a
thread can explicitly say when it may be cancelled... (I often end up coding
a solution like yours - a message-passing mechanism to tell threads to die -
but that always seems to add more complexity than it's worth...)

Dan

=================================TOP===============================
 Q347: What does /proc expose vis-a-vis LWPs? 

 
>
>> Thanks for the answer! I would really like to know how to see which
>> thread is running on which processor to see if my multithreaded
>> app (which uses the pipeline model) is really using the 6 available CPUs on my
>> platform. Is there such a beast?
>
>/proc on Solaris doesn't expose this information, so I doubt that any
>non-Sun utility can show it. I don't know if Sun has something (bundled or
>unbundled). As for migrating LWPs from one processor to another - it's
>perfectly normal on Solaris.

You are wrong.  /proc does provide this information, in the lwpsinfo
struct contained in /proc//lwp//lwpsinfo for each lwp in
the process:

    processorid_t pr_onpro;         /* processor which last ran this lwp */

It is displayed with the prstat utility.  Use the command 'prstat -L'
to see each lwp in each process.

Roger Faulkner
[email protected]
=================================TOP===============================
 Q348: What mechanism can be used to take a record lock on a file? 


> whats mechanism can be used to take a record lock on a file (using the
> fcntl() call), in a posix multi threaded application.  Seems to me that
> these locks are process based, and therfore multiple threads within the same
> process are treated as the same thing.
>
> Any pointer would be appreciated

This has been discussed several times before. Yes, fcntl() locks are
process-based, for a number of reasons historical and pragmatic. Some people
have successfully built a two-level file locking strategy that uses mutexes
between threads within a process and fcntl() between processes. Essentially,
you reference count the fcntl() lock(s) so that the process holds an fcntl()
lock whenever any thread within the process has an area locked; if more than
one thread within the process is interested in the same file area, they
synchronized among themselves using a mutex. I believe that sample code may
have been posted. Search the newsgroup archives, if you can find a good server.
(I don't know what the state of Deja is now; it was always a good one, and may
be again if the transfer of control has been straightened out.)

/------------------[ [email protected] ]------------------\
=================================TOP===============================
 Q349: Implementation of a Timed Mutex in C++ 


Thanks to everbyody who spend brain in my program.
It works!(stable like a rock)
Here it is. If it is usefull for somebody -> use it:

#include
#include
#include
#include
#include
#include"pthread.h"

typedef struct
{
  pthread_mutex_t   mutex;
  pthread_cond_t    cond;
  pthread_t     owner;
  int           value;
}Mutex_t;

void*       main_mutex;     /* Pointer to my main_Mutex */  
int         mutexTestCnt = 0;   /* Counter */
pthread_t       thread;     
pthread_cond_t  startcond;      /* Cond to start threads */
pthread_mutex_t startmutex;

int MutexCreate(void* *id)
{
  Mutex_t *Mutexvar  = malloc(sizeof(Mutex_t));
  if(Mutexvar == NULL) {return -1;}
  pthread_mutex_init(&Mutexvar-;>mutex,NULL);
  pthread_cond_init(&Mutexvar-;>cond,NULL);
  Mutexvar->value=1;
  *id = (void*)Mutexvar;
  return 0;
}
int MutexDelete(void* id)
{
  Mutex_t *mutex =(Mutex_t *)id; 
  if (mutex->value!=1) {return -1; }
  free(mutex);
  return 0;
}  

int MutexObtain(void* id, int timeoutrel)
{
   Mutex_t *mutex =(Mutex_t *)id;  
  int status=0;
  struct timeval now;
  struct timespec timeout;
  if(mutex == NULL)  return -1;
  pthread_mutex_lock(&mutex-;>mutex);
  if ((mutex->value<0)||(mutex->value>1))
  { 
    pthread_mutex_unlock(&mutex-;>mutex);
    return -2; 
  }
  if (mutex->value==0)
  {
    gettimeofday(&now;,NULL);
    timeout.tv_sec = now.tv_sec + timeoutrel;
    timeout.tv_nsec = now.tv_usec * 1000;
    do{
      status=pthread_cond_timedwait(&mutex-;>cond,&mutex-;>mutex,&timeout;); 
      if(status==ETIMEDOUT) 
      {
        pthread_mutex_unlock(&mutex-;>mutex); 
        return -3; 
      }      
    }while((status!=0)||(mutex->value!=1));
  }
  mutex->value=0;
  mutex->owner=pthread_self(); 
  pthread_mutex_unlock(&mutex-;>mutex); 
  return 0;
}
int MutexRelease(void* id)
{
  Mutex_t *mutex =(Mutex_t *)id; 
  pthread_mutex_lock(&mutex-;>mutex);
  if ((mutex->value<0)||(mutex->value>1))
  { 
    pthread_mutex_unlock(&mutex-;>mutex);  
    return -1; 
  }
  if (pthread_equal(mutex->owner,pthread_self())==0) 
  {
    pthread_mutex_unlock(&mutex-;>mutex);  
    return -2; 
  }
  mutex->value=1;
  mutex->owner=0;
  pthread_cond_signal(&mutex-;>cond);
  pthread_mutex_unlock(&mutex-;>mutex);  
  return 0;
}
void *testfunc(void * arg)
{
  int i;
  pthread_mutex_lock(&startmutex;);   /* Start all threads at the same
time */
  pthread_cond_wait(&startcond;,&startmutex;);
  pthread_mutex_unlock(&startmutex;);
  printf("Thread %s started as %i.\n",(char *)arg,pthread_self());

  for(i=0;i<100000;)
    {
      if(MutexObtain(main_mutex, 1000) != 0)
    {
      printf("Thread %i: MutexObtain() FAILED\n", thread);
    }
      /* Modify protected variables */
      i = ++mutexTestCnt;
      thread=pthread_self();
     /* Release CPU */
      sched_yield();
      /* And check if somebody else could get into the critical section
*/
      if(i!=mutexTestCnt)
    {
      printf("Thread %i: Mutex violated by %i\n",
          thread,pthread_self());
    }
      /* Leave critical section */
      if(MutexRelease(main_mutex) != 0)
    {
      printf("Thread %i: MutexRelease() FAILED\n", thread);
    }

      /* Allow rescheduling (another thread can enter the critical
section */
      sched_yield();
    }
    printf("Thread %s ready\n",(char *)arg);
    return NULL;

}
int main(void)
{
    pthread_t t_a,t_b,t_c;
    int ret;
    char* a;
    pthread_cond_init(&startcond;,NULL);
    pthread_mutex_init(&startmutex;,NULL);
    if(MutexCreate(&main;_mutex)!=0)  return -1;

    ret=pthread_create(&t;_a,NULL,testfunc,(void *)"a");
    if(ret!=0) fprintf(stderr,"Can't create thread a\n");
    ret=pthread_create(&t;_b,NULL,testfunc,(void *)"b");
    if(ret!=0) fprintf(stderr,"Can't create thread b\n");
    ret=pthread_create(&t;_c,NULL,testfunc,(void *)"c");
    if(ret!=0) fprintf(stderr,"Can't create thread c\n");
    printf("Press key to start\n"); getc(stdin);    
    pthread_mutex_lock(&startmutex;);
    pthread_cond_broadcast(&startcond;);
    pthread_mutex_unlock(&startmutex;);
    ret=pthread_join(t_a,NULL);
    ret=pthread_join(t_b,NULL);
    ret=pthread_join(t_c,NULL);
    MutexDelete(main_mutex);
    printf("All done\n");
    return 0;
}

=================================TOP===============================
 Q350: Effects that gradual underflow traps have on scaling. 


Dave Butenhof  writes:
> Martin Shepherd wrote:
> > By the way, neither in your book, nor in the other POSIX threads books
> > that I have, is there any mention of the devastating effects that
> > gradual underflow traps can have on scaling. I'm not even sure why
> > this is occurs, and would like to understand it better. My guess is
> > that if multiple threads are suffering underflows at the same time, as
> > was the case in my program, there is contention for a single underflow
> > handler in the kernel. Is this correct?
> 
> Perhaps, in the HP-UX kernel. I don't know. It would depend on whether the
> underflow is handled by hardware or software; and, if in software, precisely
> how and where. If you're reporting underflow traps to the APPLICATION, that's
> certainly a performance sink if you're underflowing much; signal delivery is
> expensive, and certainly doesn't help your application's scaling.

My experience on a number of systems is that gradual underflow is
usually performed in software, not in hardware, and this includes
expensive workstations and super-computers traditionally used for
number crunching. For example, Sun sparcs, HP's, Dec Alpha's etc..,
all do this. If this weren't bad enough, there is no standard way to
disable it. In Solaris one calls nonstandard_arithmetic(), on HP one
calls fpsetflushtozero(), and I don't know what one does on other
systems.

Whether gradual-underflow traps are delivered as signals all the way
to the application, or whether the kernel handles them I don't know,
but regardless, they can increase the run time of any program by large
factors, and seriously suppress scaling in parallel programs, so in
general it is really important to either avoid them or disable them.
In particular, the ability to reliably disable them process-wide, just
as a diagnostic aid, is indispensable, because vendors rarely provide
tools to monitor them.

> This is getting extremely machine-dependent, and therefore it's hard to say
> much about it in a general book. Furthermore, even on platforms where it's a
> problem, it's only going to affect the (relatively, and maybe absolutely)
> small number of FP-intensive applications that do a lot of
> underflowing.

While it is true that most FP-intensive applications shouldn't
underflow, and that good programmers will do their utmost to avoid
performing any calculations that might underflow, everybody makes
mistakes. In my case, once I worked out how to globally enable sudden
underflow across all of my threads, my program speeded up by a factor
of 4. This then led me to a bug in the test that was supposed to have
prevented the underflowing calculations in the first place, and the
end result was a factor of 15 speedup.  I agree that this is somewhat
specialized and very machine specific, but so are the discussions of
memory barriers, and memory caching models that one finds in good
books on parallel programming with threads...

Martin 
=================================TOP===============================
 Q351: LinuxThreads woes on SIGSEGV and no core dump. 

>
> is there something inherently wrong with my system or is this all
> "normal" behaviour? i'm using the pthreads shipped with glibc 2.1.2 -
> they might be a bit old, but i don't want to get into a big fight with
> my sysadmin.

I have experienced all sorts of strange errors similar to yours. The
workaround is to include this in your program:

  void sig_panic(int signo) {
    pthread_kill_other_threads_np();
    abort();
  }
  
  ..
  struct sigaction act;
  memset(&act;,0,sizeof(act));
  act.sa_handler = sig_panic;
  sigaction(SIGSEGV, &act;, NULL);
  sigaction(SIGBUS, &act;, NULL);
  sigaction(SIGTRAP, &act;, NULL);
  sigaction(SIGFPE, &act;, NULL);

This produces reliable core dumps and you can do a post-morten analysis.


Regards,
  Ivan
=================================TOP===============================
 Q352: On timer resolution in UNIX. 


Under most Unix flavors, user processes (usually) enjoy the
10ms resolution. This is the time unit the kernel dispatcher
is timer-interrupted to handle ``asynchronous'' events. When
timer-related events, such as, firing, handling, etc., are
bound to the dispatcher `tick', it is not possible to get
finer resolution than that.

But, there are several exceptions to the above, especially
on machines equipped with ``cycle counters''.

IRIX 6.5.x allows privileged processes to call nanosleep()
with sub-millisecond resolution. The actual resolution is
only restricted by the overhead to dispatch a kernel thread
to handle the event. I have seen reaction times in the range
of 300-400 micro-seconds on 200 MHz 2 CPU systems. The same
is true for timers (see timer_create()) based on the
CLOCK_SGI_FAST timer, which is IRIX specific, and thus, not
portable.

Solaris 8 finally managed to be able to disassociate the
handling of timer events from the scheduler tick. One can
utilize the high-resolution cycle counter by specifying
CLOCK_HIGHRES for clock-id in the timer_create(3RT) call. I
have seen sub-millisecond resolutions under Solaris
8. Unfortunately nanosleep(3RT) is still bound to the 10ms
dispatcher tick. For earlier Solarise's one could change the
HZ (or something like that) variable to, say, 1000, in order
to obtain 1 millisecond dispatcher tick duration. Some
people claimed that this can be tuned to 10000, but then the
system could spend most of its time serving the timer
interrupts.

HP-UX 11.00 supports the 10ms resolution with nanosleep()
and timer_create(). One needs to get special real-time
version of the kernel in order to have access to higher
resolution timers.

From a casual perusal of BSD4.4 derivatives (and I think
also in Linux systems) the best on can get is the 10ms
resolution.

In POSIX systems the portable way to request
``high-resolution'' timers is via the CLOCK_REALTIME clockid
in timer_create() which is guaranteed to be as small as
10ms.  I have not seen any system giving finer resolution
than 10ms with timer_create()and CLOCK_REALTIME.

I don't have access to AIX or Tru-Unix 64.

poll(), select(), sigtimedwait() offer the usual 10ms
resolution.



Michael Thomadakis
Computer Science Department
Texas A&M; University


Joe Seigh wrote:

> bill davidsen wrote:
>
> >   I believe that the resolution of select() is actually 100ms, even
> > though you can set it in us.
> >
> I think what you are seeing is probably an artifact of the scheduler.  It
> looks like what the system is doing is when the timer pops, the system just
> marks the thread ready and the the thread has to wait until the next available
> time slice.  On solaris for programs in the time sharing class this appears
> to be about 10 ms or so.  Try timing nanosleep with various settings to
> see this affect.
>
> You might try running the program at real time priority to put it in the
> real time scheduling class and playing with the scheduler's real time
> parameters.  However setting up the kernel for real time stuff probably
> increases the kernel overhead significantly, so if you are looking for
> overall system throughput, this is not the way to do it.
>
> For non timed waits, I've seen a lot less latency.  This is probably because
> the pthread implementation chose to pre-empt a running thread.  The implication
> of this is that they are rewarding the cooperative processing model though
> possibly at the expense of extra context switching unless you do something
> to alleviate that.
>
> Joe Seigh
=================================TOP===============================
 Q353: Starting a thread before main through dynamic initialization. 

> c) As my program starts I might start a thread before main because of
> some other file static object's dynamic initialization. This thread
> might acquire my lock xyzzy before that lock is dynamically initialized
> setting xyzzy.locked_ to 1.

    My coding policies do not permit this. I recommend
that you don't allow it either. Threads should not be
started by the initialization or creation of static
objects. This just makes too many problems.

    For many of my classes, since we know that there is
only one thread running before all initialization is
complete, we don't bother to mess with any locks, we just
bypass them. Fortunately, any thread created after a change
to a memory location is guaranteed to see that change.

    DS
=================================TOP===============================
 Q354: Using POSIX threads on mac X and solaris? 


Does any one know of any advantages or disavtanges of using posix thread
(pthread) on mac X and solaris compared to native implementations.

Do pthread make call to native implementation in both these cases and is the
maping between pthread and kernel object 1:1 .

Thanks
Sujeet
  
I don't know anything about the thread implementation on the mac. On Solaris,
pthreads are roughly equivalent to the so-called solaris threads
implementation. I believe that both APIs sit on top of lower-level calls.
The main advantage of using POSIX threads is portability. The other is
simplicity.

% Do pthread make call to native implementation in both these cases and is the
% maping between pthread and kernel object 1:1 .

The mapping between pthreads and the kernel scheduling entity in Solaris
depends on what you ask for. Note that you must be careful if you try to
use the m:n model, because the Solaris two-level thread scheduler is crap.
(this is not related to the API -- it's crap for both pthreads and UI threads).
--
 

On Mac OS X, POSIX threads is the lowest-level threading interface anyone
should be calling, at least outside the kernel. The POSIX interface uses Mach
threads, and there is an API to create Mach threads -- but it's not very
convenient. (You need to create the thread, load the registers with intimate
knowledge of the calling standard, including creating a stack and setting it to
"bootstrap" the thread.) Also, the Mach API has limited (and inefficient)
synchronization mechanisms -- IPC.

On Solaris, while libpthread depends on libthread, UI threads isn't really so
much a "native implementation"; they're more or less parallel, and happen to
share a common infrastructure, which happens (for mostly historical reasons) to
reside in libthread. You could consider the LWP layer to be "native threads",
but, somewhat like Mach threads, they're really not intended for general use.

The POSIX thread API is far more general, efficient, and portable than Mach,
UI, or LWP interfaces. Unfortunately, the POSIX thread implementation on Mac OS
X is incomplete, (it can't even build half of my book's example programs), and
I wouldn't want to count on it for much. (Though I have no evidence that what's
there doesn't work.) Still, you wouldn't be any better off working directly
with Mach threads.

Solaris, by the way, supports both "N to M" and "1 to 1" thread mappings.
Solaris 8 has a special library that's always 1 to 1. The normal libpthread
provides both N to M (Process Contention Scope, or PCS) and 1 to 1 (System
Contention Scope, or SCS); though the default is PCS and you can't change the
scope of the initial thread. Mac OS X supports only 1 to 1 scheduling.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q355: Comments on ccNUMA on SGI, etc. 


> I have a big problem with my simulation. I am trying to implement a
> parallel neural network simulator using SNNS simulator, C language and
> POSIX threads. I am using a SGI machine with 8 processors and an IRIX
> Origin 2000 system. For more than 2 weeks I am trying to make my code to
> run faster on 8 processors than on 2 - but I still can't get any
> progress ! [...]

The Origin 2000 systems are ccNUMA; that means, unlike traditional SMP
multiprocessors, all systems do not have equal access to all memory. Any
memory you use will be on one "node" or another. Threads running on that
node (each Origin 2000 node has 2 processors, so you're potentially
running threads on 4 different nodes) have fast local access. Threads
running on other nodes have to go through the network interconnects
between nodes. Those interconnects are slower (typically much slower) and
also have limited bandwidth. That is, it's probably not possible for 3
nodes to simultaneously access memory in the 4th node without severe
performance degradation over the "normal" local access.

> I have read now on the IRIX documentation, that the cache memory may be
> a very important issue - and that each thread should access the same
> subset of data all the time - for good performances. This is not the
> case in my program. And also, the network (which has around 600 units)
> and the connections are created by the main thread and - probably - are
> stored on one processor ?! This means that all the others processors are
> communicating with this one to get the unit's information ? Is this so
> bad ? This can be the only reason for the low performances ?

Running with cache is always the best strategy for performance. Modern
processors are so much faster than memory, that memory access is the only
defining characteristic of program performance. We used to count
instructions, or CPU cycles; but all that's irrelevant now. You count
memory references, as a first indicator; for detailed information, you
need to analyze the cache footprint. Most processors have multiple levels
of cache, maybe up to 3, before you hit main memory. The first level
delays the instruction pipeline by a couple of cycles. The second may be
on the order of 10 cycles, the third 20 to 100 cycles. And, relative to
normal processor speeds, if you've got to hit main memory you might as
well break for lunch. And that's just LOCAL memory, not remote memory on
some other node.

> Also, the global list of spikes is updated by all threads - and now I am
> wondering where is stored, and how I should store it, in order to have
> an efficient use of it. In the same documentation it says that you
> should store the used data on the same processor but here the spikes are
> inserted by different threads and computed by any of the threads. This
> is because the entire simulation is driven by 'events' and time issues -
> so any available thread compute the next incoming event.

Writing closely packed shared data from multiple threads, even on an SMP,
is "bad mojo". When the data lives within the same  hardware cache line,
all other processors that have written or read the data "recently" need to
see that their cached copy is now invalid, and refetch from main memory.
That, obviously, is expensive. When all of your threads are writing to the
same cache line continuously, the data in that line "ping pongs" between
processor caches. This is the number 1 program characteristic that leads
to the old "my program runs faster without threads". (The second, and less
subtle, is overuse of application synchronization.) And remember, in a
ccNUMA system like yours, anything dealing with memory is far worse unless
the memory is local to your node. Obviously, memory shared by all your
threads cannot possibly be local to all of them unless you're using only a
fraction (2) of the available processors. That is very likely why you ran
into the magic number "2 threads (processors)". When you're using only 2,
the system can keep both of them, and their memory, on the same node.
Beyond 2, that's impossible.

I'm not sure how IRIX manages your memory in this case. Some systems might
automatically "stripe" memory that's not otherwise assigned across all the
nodes. (If there's enough data to do that.) That may tend to even out the
non-local memory references, and can often perform better than simply
putting all the memory into one node. On the other hand, memory that's not
explicitly assigned is often allocated on the first node to reference the
memory; and if your startup initializes your entire data array (or
allocates it all from malloc), then it's likely that the entire data set
IS on the node where you started. Which means that the other 3 nodes are
beating on its interconnect port continuously, and you're operating in the
worst case performance mode.

The best strategy (if you can) would be to explicitly target memory to
specific nodes along with two specific threads that will be doing all
(ideally) or most of the access to that memory. (In your case, this
probably isn't possible; but the closer you come, the better your
performance will be.) Even making sure that your global arrays are striped
might help. In fact, even making sure that they're allocated from two
nodes instead of just one might double your performance. I'm not familar
with the IRIX APIs for assigning memory (or threads) to specific
ccNUMA nodes, but such things must exist, and you might consider looking
them up and giving it a try.

Otherwise, you might consider limiting your application to a single node.
Given that your application sounds pretty heavily CPU bound with
relatively little I/O, you're unlikely to gain any advantage in that case
from more than 2 threads. (The more blocking I/O you do, the more likely
it is that additional threads will improve throughput.) If you can split
the dataset more or less in half, you might consider doing that across 2
nodes, with 4 threads, and see how that works.

Just as optimizing threaded performance has started to go from pure black
magic to something that's almost engineering, along comes ccNUMA and
breaks all the rules and brings back that element of magic. Welcome to the
bleeding edge, and... good luck. ;-)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
 

Origin 2000 is rather old at this point.  Origin 3000 is
the current system, and its memory system is even less
NUMA than the Origin 2000.

> [...good stuff snipped...]

>                                                          I'm not familar
> with the IRIX APIs for assigning memory (or threads) to specific
> ccNUMA nodes, but such things must exist, and you might consider looking
> them up and giving it a try.

Indeed, one has complete control over where memory is placed.

> [...more snipped...]

> Just as optimizing threaded performance has started to go from pure black
> magic to something that's almost engineering, along comes ccNUMA and
> breaks all the rules and brings back that element of magic. Welcome to the
> bleeding edge, and... good luck. ;-)

I'd hardly call NUMA bleeding edge after all these years.

One thing Dave didn't bring up is "execution vehicle"-pthread
affinity.  In an M-on-N pthread implementation the kernel
schedules the "execution vehicles" and the library schedules
the pthreads onto them.  The kernel is cache+memory affinity
aware and tries to schedule execution vehicles to maximize
affinity, while trying to be fair, schedule real time threads,
etc.  The library has to avoid deadlock, observe priorities,
and schedule what could be far more threads than execution
vehicles.  What can happen is that IRIX may nicely place
say 5 execution vehicles on the 4 CPUs in one C-brick, and
1 CPU in another "nearby" C-brick, and leave them there,
maximizing affinity, but the library, for a variety of reasons,
may end up moving pthreads around on these execution vehicles in a
way that is not affinity friendly.

For CPU intensive applications this may be a performance issue,
so the library provides a nonportable scope: PTHREAD_SCOPE_BOUND_NP
to bind a pthread to an execution vehicle.  For realtime and
applications which typically run alone on a system the library
provides a nonportable call: pthread_setrunon_np() to force a
bound (or system scope) thread to run on a particular CPU.

I understand that Sun recently released an alternate version
of its pthread library which has a N-on-N implementation.  I'd
guess they did this because of the same affinity issue.  Does
anyone know different?
=================================TOP===============================
 Q356: Thread functions are NOT C++ functions! Use extern "C" 


Patrick TJ McPhee wrote:
> 
> In article ,
> Doug Farrell  wrote:
> 
> % And again you refer to 'the standard C++ thread function', what are you
> % talking about?
> 
> There isn't any, but do use it if you don't want to pass a C function.
>
> [Cry of frustation followed by general elucidation omitted]
>
> If it causes you emotional distress to create a C function,
> then use the standard C++ thread class (keeping in mind that there
> isn't one).

Just so's this doesn't go on and on and on: Patrick, is it fair to
assume that you are ladling on the irony here?

Doug, the essence of what has been said so far is this:

pthread_create's prototype in C++ is:

  extern "C" pthread_create(pthread_t *, pthread_attr_t *,
                            void *(*start_routine)(void *),
                            void *);

See that `extern "C"'?  That covers _all_ function types in the
declaration; in particular the start_routine function pointer, whose
type is actually

  extern "C" void *(*)(void *);

that is, `pointer to C function taking void * and returning void *'.
By passing a function whose C++ prototype is:

  class SomeClass {
    // ...
    static void *threadfn(void *);
  };

or just:

  void *threadfn(void *);

  (therefore, &threadfn; is `pointer to C++ function taking void * and
  returning void *'),

you are invoking undefined behaviour.  Your implementation is now
allowed to activate your modem and phone the speaking clock in Ulan
Batur, amongst other things.  You _know_ you _mustn't_ invoke
undefined behaviour, just as you _know_ that unless you feel obliged
to by current compilers' handling of implicit template
instantiation, you shouldn't put the implementation in the header
file...*

In short, using POSIX threads, you cannot put the argument to
pthread_create inside the class.  Period.  Put it in the
implementation file inside an anonymous namespace, or use a global
static, and pass it a pointer to the class as its argument, like this:

  class SomeClass { void *threadfn(); };

  extern "C" static void *threadfn(void *args)
  {
    SomeClass *pSomeClass = static_cast(args);
    return pSomeClass->threadFn();
  }

Guy (not saying anything further just in case it starts another
pointless "Standard C++ is broken with respect to threading--oh no it
isn't--oh yes it is etc. ad nauseam" thread).

*Ask yourself: given a header file containing the implementation, or a
library/object file containing the implementation, what must my users
do if I change the implementation?
=================================TOP===============================
 Q357: How many CPUs do I have? 


NoOfCpus = sysconf(_SC_NPROCESSORS_CONF); /* Linux */

GetSystemInfo(&SystemInfo;);                                       /* NT */
NoOfCpus = SystemInfo.dwNumberOfProcessors;

I have made the experience that for busy CPU bound threads the number of
threads should not extensively exceed the number of available processors.
That delivered the best performance.

[email protected]



Victor Khomenko  wrote in message
news:[email protected]...
> Hi,
>
> I want to make the number of working threads depend on the number of
> processors in the system. Is there a good way to find out this information
> (run time)? How many threads per processor is a good ratio (all threads are
> going to be pretty busy, but can sometimes wait on mutexes and conditions)?
>
> I need this information for Linux and Win32.
>
> Victor.
>
>
=================================TOP===============================
 Q358: Can malloc/free allocate from a specified memory range? 

 
> Using mmap to share data between processes leads to the requirement to
> dynamically allocate and free shared memory blocks.  I don't want to
> mmap each block separately, but prefer to allocate and free the memory
> from within the mapped region.  Is there a way to redirect malloc/free
> library functions to allocate from a specified memory range, instead of
> the heap?
> 
> I don't want to mmap each block separately or to use shmget because of
> the cost of so many mappings.
> 
> -K
The mmalloc package at:
http://sources.redhat.com/gdb/5/onlinedocs/mmalloc.html
might be a good starting point.

HTH,
--ag 
=================================TOP===============================
 Q359: Can GNU libpth utilize multiple CPUs on an SMP box? 


>> > Is there any existing patches that can make GNU libpth utilize
>> > multiple CPUs on an SMP box?
>> I recall that IBM is doing something much like it. I cannot remember
> Can you give me some clues to find it?  I've tried google, but it
> returned either too many or no results.

Here is the URL:
http://oss.software.ibm.com/developerworks/opensource/pthreads/


bye, Christof

=================================TOP===============================
 Q360: How does Linux pthreads identify the thread control structure? 


R Sharada wrote:

>     I have a query related to how Liux pthreads implementation
> idnetifies the thread control structure or descr for a current thread,
> in the case when the stack is  non-standard ( by way of having called a
> setstackaddr /setstacksize ).

First off, don't ever use pthread_attr_setstackaddr(), because it's a
completely brain-damaged interface that's inherently broken and totally
nonportable. I've explained why it's broken (both in terms of engineering
features and political history), and I won't repeat it here. (You can
always search the archives.) Just don't use it.

The next version of POSIX and UNIX (2001) contains my corrected version,
which ended up being named pthread_attr_setstack(). At some point, this
will begin to appear on Linux and other systems.

> Currently the method ( in thread_self
> routine ) just parses through the whole list of threads until one
> matches the current sp and then obtains the descr from there. This could
> get quite slow in conditions where there are a lot  of threads ( close
> to max ). Isn't there a better way to this?

No; not for Linux on X86.

>     Does anone know how this is handled in other UNIXes - AXI, Solaris,
> etc.??

The best way to handle it is to define in the hardware processor context a
bit of information that's unique for each thread. SPARC and IA-64 define a
"thread register" that compilers don't use for anything else, but can be
read by assembly code or asm(). Alpha defines a "processor unique" value
that can be read by a special instruction. I believe that PowerPC has one
or the other of those techniques, as does MIPS.

LinuxThreads can and should use these mechanisms when built for the
appropriate hardware; but on X86 (which is still the most common Linux
platform), none of this is an option. Of course, "the system" could define
a universal calling standard that reserved from the compiler code
generators "a register" that could be used in this way. However, the X86
register set is pretty small already, and, in any case, trying to make that
change NOW would be a major mistake since you couldn't use any existing
binary code (or compilers).

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q361: Using gcc -kthread doesn't work?! 


> i have a multithreaded program on a dual-pentium machine running freebsd
> 4.3. compiling everything with
>
>    gcc -pthread ...
>
> works fine, but doesn't realy make the second processor worth the money
> (i.e. everything runs on one thread). according to 'man gcc' compiling
> with
>
>    gcc -kthread
>
> should fix the problem. unfortunately, gcc tells me it doesn't recognise
> the option. in a message on mailing.freebsd.bugs i read that for freebsd
> 4.2 one had to recompile gcc with the appropriate arguments set. i did a
> make and make install in /usr/src/gnu/usr.bin/cc, but i couldn't add any
> options and the compiler turned out just the same as the last...
>
> anybody know what i should do here?

As you may have already seen, there's a FreeBSD bug report on this:

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=24843

Here are the comments in the "Audit-Trail" section at the bottom of the
page:

"The -kthread link flag was purposely removed, since linuxthreads is not 
part of the base system.  There are explicit instructions that come with 
the linuxthreads port that explain how to link with linuxthreads."

> p.s. i don't want to start a flame-war on linuxthreads vs. whatever -
> the purpose of compiling under freebsd is to be able to tell for myself
> which os is best for my needs ;)

You may find "Kernel-Scheduled Entities for FreeBSD" interesting
reading:

http://people.freebsd.org/~jasone/refs/freebsd_kse/freebsd_kse.html

I once did a test with linuxthreads (available under /usr/ports/devel)
on a dual-CPU FreeBSD system and my test program successfully used
both processors.  However, my test program was trivial, so I'd want to
do a lot more testing before I'd put anything more complicated into
production.  As you may know from reading this newsgroup, there's some
criticism of the linuxthreads model.  But at least it lets threaded
programs use multiple CPUs on FreeBSD :-)

-- 
Michael Fuhr
=================================TOP===============================
 Q362: FAQ or tutorial for multithreading in 'C++'? 


Using Only for WIN32 API !!

MSDN Library (With samples and function documentation)
http://msdn.microsoft.com/library/devprods/vs6/visualc/vccore/_core_multithr
eaded_programs.3a_.overview.htm

Thread function documentation :
http://www.clipcode.com/content/win32_3.htm

If you speak french :
http://perso.libertysurf.fr/chez_moe/programmation/index.html


Tomasz Bech  a écrit dans le message :
[email protected]...
> Hi,
>     Does anybody know about good faq or tutorial for multithreading in
> 'C++'?
>   Thanks,
>         Tomasz
>
>

=================================TOP===============================
 Q363: WRLocks & starvation. 


> "Dave Butenhof"  schrieb im Newsbeitrag
> news:[email protected]...
>
> The UNIX 98 standard (and the forthcoming POSIX 1003.1-2001 standard)
> includes  POSIX read-write lock interfaces. You'll find these interfaces implemented
> (at least) on AIX 4.3.x, Solaris 8, Tru64 UNIX 5.0, and any moderately recent
> version of Linux. Earlier versions of Solaris and Tru64 UNIX also provided
> different nonstandard interfaces for read-write locks.
>
> I guess these implementations will take care of classic problems like
> starvation of the writer, don't they?

Sure. If they want to. In whatever manner thought best by the designers. (Or in
whatever way the code happened to fall out if they didn't bother to think about
it.)

Even the POSIX standard read-write lock doesn't require any particular
preference between readers and writers. Which (if any) is "right" depends
entirely on the application. Preference for readers often results in improved
throughput, and is frequently better when you have rarely updated data where
the USE of the data is substantially more important than the updates. (For
example, the TIS read-write locks on Tru64 UNIX were developed specifically to
replace a completely broken attempt to do it using a single mutex in the libc
exception support code. It used the construct to manage access to the code
range descriptor list for stack unwinding; or to update it with a newly loaded
or generated code range. Read preference was appropriate, and sufficient.)

Write preference can be better when you don't care principally about
"throughput", or where multiple readers are really relatively rare; and where
operating on stale data is worse than having the readers wait a bit. (Or where
you simply cannot tolerate the possibility of a starving writer wandering the
streets.)

A generally good compromise is a modified FIFO where adjacently queued readers
are batched into a single wakeup; but that still constrains reader concurrency
over read preference and increases data update latency over writer preference.
Like all compromises, the intention is more to keep both sides from being angry
enough to launch retaliatory nukes, rather than to make anyone "happy". It does
avoid total starvation, but at a cost that may well be unacceptable (and
unnecessary) to many applications.

It wouldn't make sense for the standard to mandate any of those strategies.
Partly because none of them is "best" for everyone (or even for ANYone). Partly
because there are probably even better ideas out there that haven't been
developed yet, and it makes no sense to constrain experimentation until and
unless a clear winner is "obvious". (For example, had the standard specified a
strategy, it would have been either reader or writer, not "modified FIFO",
because the latter wasn't in wide use at the time.)

We considered a read-write lock attribute to specify strategy. We decided that
this would be premature. While we've advanced a bit in the intervening time, I
think it would still be premature. Though of course individual implementations
are welcome (and even encouraged) to experiment with such an attribute. If some
set of strategies become relatively common practice, the next update of POSIX
and UNIX (probably 2006) could consider standardizing it.

> I enjoyed the discussion about how to implement condition variables in
> Win32. What would the windows implementation of these read-write lock
> interfaces look like?

Probably already done, somewhere. Go look! I don't even want to THINK about it.
(But then, I feel that way about anything Windows-ish. Everyone, except
possibly Bill Gates, would be better off without Windows.)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q364: Reference for threading on OS/390. 


Gee M Wong wrote:

> I've got a new project starting up, and it has been over a decade since 
> I last wrote a C/C++ program on the mainframe.  Would someone please 
> suggest a current libraray and reference for threading on OS/390 
> (preferably Pthread).

http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/CBCPG030/4.3

"4.3 Chapter 23. Using Threads in an OS/390 UNIX Application..."

http://www.ibm.com/software/ad/c390/cmvsdocs.htm

"OS/390 C/C++ Library Start here to access the OS/390 C/C++ 
publications available on the Web..."
=================================TOP===============================
 Q365: Timeouts for POSIX queues (mq_timedreceive()) 

> > Wile a thread is waiting for a message to arrive in a message queue,
> > using mq_receive(), I'd like to have a way to unblock the thread when
> > after a certain timeout no message has arrived. In pSOS the timeout is
> > a parameter of the q_receive() call.
> > Is this also possible using POSIX queues?
> 
> Well, sort of. The POSIX 1003.1d-1999 amendment to POSIX 1003.1-1996
> includes an mq_timedreceive() function that allows you to specify a
> timeout. However, it's not widely implemented yet, and likely won't be
> available on your platform. (You haven't said what your platform is;
> "POSIX" doesn't help much since there's no such operating system!)
You're right. Actually, the software should work on multiple
platforms, being Linux and pSOS the most important. mq_timedreceive()
is not implemented in pSOS.
> 
> > If not, is there a work around for this problem?
> 
> You could always create a thread that waited for the specified interval
> and then sends a special message to the queue, awakening a waiter. 
Yes, I tried that one. It works, but I wondered if there is a more
elegant way to do this. As you pointed out, this is mq_timedreceive()
(maybe implement my own mq_timedreceive for pSOS?)
> You could also interrupt it with a signal, causing an EINTR return from
> mq_receive(); though that imposes a number of complications, including
> deciding what signal number to use, what happens (what you want to
> happen) when the thread isn't actually waiting in mq_receive(), and so
> forth.
> 
> You can't use alarm(), because the signal it generates isn't directed at
> any particular thread but rather to the process as a whole. (Although you
> can get away with it if you control the main program, so that you can
> ensure SIGALRM is blocked in all threads except the one you want to
> interrupt.)
Thanx for that one. I have to check if alarm() is supported by pSOS.
> 
> If you can wait for the platforms you care about to implement
> mq_timedreceive(), that'd be the best solution. Otherwise... choose your
> hack.
> 
> /------------------[ [email protected] ]------------------\
> | Compaq Computer Corporation              POSIX Thread Architect |
> |     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
> \-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q366: A subroutine that gives cpu time used for the calling thread? 


> I would like to write a subroutine that gives cpu time used for the calling
> thread. I used times (I'm under Tru64 V5.0 using pthread), and it returns a
> cumulative cpu time, not the cpu time for the given thread. Any suggestions ?

The 1003.1d-1999 amendment to POSIX added optional per-thread clock functions; but I
doubt they're implemented much of anywhere yet. (And definitely not on Tru64 UNIX.)
Where implemented, you'll find that  defines _POSIX_THREAD_CPUTIME, and
you could call clock_gettime() with the clock ID CLOCK_THREAD_CPUTIME_ID (for the
calling thread), or retrieve the clock ID for an arbitrary thread (for which you
have the pthread_t handle) by calling pthread_getcpuclockid().

(I'd like to support this, and a lot of other new stuff from 1003.1d-1999 and
1003.1j-2000, as well as the upcoming UNIX 2001. But then, there are a lot of other
things we'd like to do, too, and I couldn't even speculate on time frames.)

Whether there are any alternatives or "workarounds" depends a lot on what you're
trying to accomplish.

In any case, times() MUST return process time, not thread time. That's firmly
required by the standard. Otherwise, times() would be broken for any code that
wasn't written to know about threads; which is most of the body of UNIX software.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP=============================== 
 Q367: Documentation for threads on Linux 

> "Dan Nguyen"  wrote in message
> news:[email protected]...
> > Robert Schweikert  wrote:
> > > I am looking for some documentation for threads on Linux. What I am
> > > after is some idea what is implemented, what works, what doesn't. Where
> > > the code is, and what is planned for the future.
> >
> > Linux uses a 1-1 type threading model.  LinuxThreads as it is known is
> > a kernel level thread using the clone(2) system call (only available
> > in Linux, and don't use it yourself).  It implemnts the pthread
> > library, so any pthread application should run correctly.
> >
> I am everthing but an expert on this, but it seems pthread is not fully
> implemented on Linux.

This is correct. The essential problem is that clone() doesn't, currently,
support the creation of multiple THREADS within a single PROCESS. Instead, it
creates multiple PROCESSES that share a single ADDRESS SPACE (and other
resources). The basic distinction is that each clone()d process has its own
pid and signal actions, and that they lack a shared pending signal mask.

While these deficiencies can be critical for some code, the LinuxThreads
implementation does enough extra work "under the covers" that most threaded
applications won't notice. There are people working on solving the problems,
so you can expect them to be "short term".

> Have a look at: comp.os.linux.development.apps The thread from the 11th June
> 2001 called "sharing Pthread mutexes among processes".

POSIX provides an OPTION supporting "pshared" synchronization objects, that
can be used between processes. Implementations need not provide every option
to be "POSIX". If by "full POSIX" you choose to mean "an implementation
correctly and completely providing all mandatory and optional features and
behaviors", then I doubt any exist.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q368: Destroying a mutex that was statically initialized. 

Ross Smith wrote:

> David Schwartz wrote:
> >
> > > Thanks. Apparently even Mr Butenhof makes the occasional mistake :-)

Like R2D2, I have been known to make mistakes, from time to time. Still, this
particular example isn't one of them.

What I actually said was "You do not need to destroy a mutex that was statically
initialized using the PTHREAD_MUTEX_INITIALIZER macro." And you don't. You CAN, if
you want to; but you don't need to. Why should you? It's static, so it never goes
out of scope. You can't have a memory leak, because the little buggers can't
reproduce. If you want to destroy one, and even dynamically initialize a new mutex
at the same address, have at it.

> >         Let me point out one more thing: It really doesn't make sense to
> > attempt to statically initialize something that's dynamically created.
> > So you shoulnd't be statically initializing a mutex that isn't global
> > anyway. And if it's global, you should never be done with it.
> >
> >         Can you post an example of a case where you are done with a statically
> > initialized mutex where it isn't obvious that dynamic initialization is
> > better?
>
> Any case where the mutex isn't global, I would have thought.
>
>   void foo() {
>     pthread_mutex_t mutex(PTHREAD_MUTEX_INITIALIZER);
>     // ...
>     pthread_mutex_destroy(&mutex;);
>   }

You can't do that. It's illegal. You can ONLY use the POSIX static initializers for
STATICALLY ALLOCATED data. Nevermind that compilers will happily compile the broken
code: that doesn't mean it's not broken any more than being able to compile
"x=0;z=y/x;" means you should expect it to work. You're violating POSIX rules. The
resulting code MAY work (at least sometimes, or "appear to work" in some
situations) on SOME implementations, but it is not legal POSIX code and isn't
guaranteed to work anywhere.

Of course, this is legal:

     void foo() {
         static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
         // ...
     }

But that's not the same thing. ALL invocations of foo() share the same global
mutex. Private mutexes aren't much good, anyway. Your example is pointless unless
foo() is creating threads and passing the address of "mutex" to them for
synchronization; in which case it had better also be sure all threads are DONE with
the mutex before returning. It must also use pthread_mutex_init() to initialize
"mutex", and pthread_mutex_destroy() to destroy it before returning.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

> 
> I am looking for some documentation for threads on Linux. What I am
> after is some idea what is implemented, what works, what doesn't. Where
> the code is, and what is planned for the future.
> 
The best resourse for Linuxthreads documentation is in the info pages
for libc ('info libc') -- under the headings ADD-ONS -> `POSIX Threads'.

HTH,
Artie Gold, Austin, TX   

I found cprof at http://opensource.corel.com/cprof.html very useful.

Regards,
  Erik.

On Wed, 13 Jun 2001, stchang wrote:

> We are developing muti-thread program code. However, it does not have
> good performance. The performance is about 1.5X compare with
> non-mutithread code. Sometime, it slower than non-mutithread code. Does
> someone give me some idea about how to profile muti-thread code or
> analysis thread?
>
> Thanks!

stchang wrote:

> We are developing muti-thread program code. However, it does not have
> good performance. The performance is about 1.5X compare with
> non-mutithread code. Sometime, it slower than non-mutithread code. Does
> someone give me some idea about how to profile muti-thread code or
> analysis thread?

The first, and often the best tool to apply is common sense.

You don't say on what hardware (or OS) you're running. Actually, for an
untuned application, if you're running on a 2-CPU system, 1.5X speedup
isn't at all bad.

However, a performance decrease isn't particularly surprising, either. It
means you're not letting the threads run in parallel. There are many
possible reasons, some of which are due to "quirks" of particular
implementations. (For example, on Solaris, you need to use special
function calls to convince the system you're on a multiprocessor.)

The most common reasons are that your application is designed to "wait in
parallel". Contention for a common resource is the most common problem.
For example, all threads do all (or nearly all) their work holding one
particular application mutex. No matter how many threads you have, they
can't do anything significant in parallel, and they waste time in
synchronization and context switching. Guaranteed to perform worse than
single-threaded code, trivial to write.

The contention may not even be in your code. If they all do I/O to a
common file (stream or descriptor), they will spend time waiting on the
same kernel/C/C++ synchronization. If that I/O drives the performance of
the application, you lose.

The problem might even be in how you're using your hardware. When
processors in an SMP or CC-NUMA system repeatedly write the same "block"
of memory, they need to synchronize their caches. If all your threads are
busily doing nothing else on all available processors, you can reduce
those processors to doing little but talking to each other about what
cache data they've changed.

Adding threads doesn't make an application "go faster". Careful design
makes it go faster. Careful and appropriate use of threads is one TOOL a
developer can use when designing a faster application. But it affects
every aspect of the design (not just implementation) of the code.

Sometimes you do need to analyze the behavior of running code, and it's
nice to have good tools. (If you can run on Tru64 UNIX or OpenVMS, by the
way, Visual Threads does an awesome job of helping you to understand the
synchronization behavior of your program.) Regardless of the tools,
though, good performance comes from careful design and thorough
understanding of what your application does, and the demands it places on
the OS and hardware; the sooner in the design cycle you accomplish this,
and the more completely you apply the knowledge, the better the results
will be.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q369: Tools for debugging overwritten data. 


> Oh btw, suppose I've got a local variable in a thread function, and I KNOW
> something overflows or overwrites it (when NOT supposed to happen), is there
> a way to find out who trashes it ?

There are several tools to it:
 - purify by Rational Software (www.rational.com)
   Very good tool, but expensive
 - Insure by Parasoft
   Very good, has a few quirks but nothing serious. Can catch illegal
   parameters to systems calls too.
If your budget can't handle the above tools, or if you can limit the
trashing to the heap, you can look into:
 - electric fence
   freeware, pretty good debug-heap. I have encountered a few problems
with
   fork()ing multithreaded programs under solaris, though.
 - miscellaneous debug heaps
 - idh (www.platypus.adsl.dk/idh/index.html)
   (disclaimer: I wrote it)
=================================TOP===============================
 Q370: POSIX synchronization is limited compared to win32. 


On Fri, 20 Apr 2001 23:28:55 -0400, Timur Aydin  wrote:
>Hello everybody,
>
>After quite some time doing multithreaded programming under win32, I have
>now started to do development under Linux using LinuxThreads. However, I am
>noticing that the synchronization objects are quite limited compared to the
>ones under win32.

However, the nature and variety of the objects provided by Win32 leaves much to
be desired. Events are simply not very suited for solving a wide variety of
synchronization problems.  It's a lot easier to solve synchronization problems
with condition variables because they have no programmer visible state. The
logic is based entirely on the state of your own data. Objects like events or
semaphores carry their own state; to solve a synchronization problem, the
programmer must bring about some meaningful association between the semaphore's
state and the state of the program. In my programming experience, such
associations are fragile and difficult to maintain.

>As far as I have learned, it is not possible to do a timed
>wait on a mutex or a semaphore.

Timed waits on mutexes are braindamaged for most kinds of work. They
are useful to people working in the real-time domain, so the 200X draft
of POSIX has added support for timed mutex waits---it was due to pressure
from some real time groups, apparently. In real time applications, the
duration of a critical region of code may be determined precisely,
so that a timed out mutex wait can serve as a watchdog.
You can find the implementation of pthread_mutex_timedlock in glibc 2.2. 
For reasons of efficienty, not every mutex type supports this operation, just
the default one.  Glibc 2.2 also adds barriers, and the POSIX timer functions:
timer_create and friends.

Also realize that the Linux pthread_mutex_t is a lot closer to the Windows
CRITICAL_SECTION than to the Windows mutex. Note that there is no timed
lock function for critical sections!

>Also, while under win32 the synchronization objects can have both
>interprocess and intraprocess scope, under linux the only object that can do
>this is the semaphore. 

The traditional UNIX semaphore, that is.

>So you can't have a mutex or a condition object that
>can be accessed by separate processes.

There is a provision in the POSIX interface for process shared mutexes and
conditions, but it's not implemented in Linux.

>And, lastly, it is not possible to
>wait on multiple objects simultaneously.

Again, this is a braindamaged concept to begin with, and severely limited
in Windows (only 64 handles can be waited on). Not to mention that
the WaitForMultipleObjects function is broken on Windows CE, so it
cannot be considered portable across all Win32 platforms.
Lastly, it has fairness issues: under the ``wait for any'' semantics, the
interface can report the identity of at most one ready object, regardless of
how many are actually ready. This can lead to one event being serviced
with priority over another one, depending on its position in the array.

With condition varibles, your program is waiting for a *predicate* to become
true. The condition variable is just a place to put the thread to sleep.
If you want to wait for more than one predicate, just form their logical
conjunction or disjunction as needed, and ensure that signaling of the
condition variable is done in all the right circumstances, e.g.

    /* wait for any of three predicates */

    while (!predicate1() || !predicate2() || !predicate3())
    {
    pthread_cond_wait(&cond;, &mutex;);
    }

This is equivalent to waiting on three events. The thread is parked in some
wait function, and can wake up for any of three distinct reasons.
A better structure might be this:

    int p1 = 0, p2 = 0, p3 = 0;

    /* mutex assumed locked */
        
    for (;;) {
    p1 = predicate1();
    p2 = predicate2();
    p3 = predicate3();

    if (p1 || p2 || p3)
        break;

    pthread_cond_wait(&cond;, &mutex;);
    }

    if (p1) {
    /* action related to predicate1 */
    }

    if (p2) {
    /* action related to predicate2 */
    }

    if (p3) {
    /* action related to predicate3 */
    }

Multiple object waiting is primarily useful for I/O multiplexing; for this you
have the poll() or select() function.  Both of these functions provide feedback
about which file descriptors are ready, thereby avoiding problems of fairness,
and can handle much more than 64 descriptors.

=================================TOP===============================
 Q371: Anyone recommend us a profiler for threaded programs? 

On 2 Jul 2001, Bala wrote:

> Hi, can anyone recommend us a profiler, possible free, that will profile
> multi-threaded programs based on pthread?
>
> Our development platform is Linux x86 and Solaris. We've looked at gprof, but
> accoding to the docs it says that it won't do MT apps.
>
Maybe (?) the Linux Trace Toolkit  can help you?

-- 
"I decry the current tendency to seek patents on algorithms. There are
 better ways to earn a living than to prevent other people from making
 use of one's contributions to computer science."  D.E. Knuth, TAoCP 3
=================================TOP===============================
 Q372: Coordinating thread timeouts with drifting clocks. 


> > > Hello all,
> > >
> > > We have a small problem in our application. It is that our computer
> > > (running Solaris 7 on UltraSparc) is synchronised with several other
> >  >
> >    I find this surprising. My experience has been that the SPARC systems
> > have extremely stable clocks, almost good enough for use as time
> > references. Even without NTP the worst drift I ever saw with any of our
> > SPARC systems was 2 seconds per month.
> >    Are you sure the NTP server is stable?
> >
> 
> The problem is not so much the stationary situation. Our system must be
> synchronised to an external system that may or may not be synchronised with UTC.
> As long as everything is running stationary, everything is fine. However, from
> time to time it is necessary to change the time reference in the external system
> and, hence, also for the Sparcs. This creates the problem. If we cannot find out
> when the clock changes or acter for it by using relative times, we will have to
> make a manual procedure whereby the spark software is reset so to speak. We
> would like to avoid this.
> 


You could look into some of the timer facilities.  See setitimer or clock_settime.
It may be that one of the clock types will take into account adjustments to the
system clock.  You should use sigwaitinfo or sigtimedwait to wait for the timer
signals rather than signal handlers so that you don't run into the problem that
practically nothing is async-safe with respect to threaded code.  sigtimedwait
appears to be a relative time interval but I don't know what clock type it
uses then or whether that can be changed.

Secondly, whatever is changing the system clock should be using adjtime() so you
don't have problems like this.  That's the whole point of adjtime.

Thirdly, I don't know why people think you can have synchronized clocks.  This
is meaningless and the laws of physics don't support that kind of thing.  All
you can do is determine approximate synchronicity of two clocks with some
amount of accuracy and certainty.  And that's making a lot of assumptions
and probably ignoring relativistic effects.  That's all NTP does.

And if you can deal with unsynchronized clocks, then having a clock appear to go
backwards once is a while is nothing.

Joe Seigh
=================================TOP===============================
 Q373: Which OS has the most conforming POSIX threads implementation? 


Mine, of course, on Tru64 UNIX V5.1A. (Oh yeah, but it hasn't actually 
released yet... ;-) )

Seriously, though, any "fully conforming POSIX threads implementation", 
right now, is broken and shouldn't be used. There are several serious bugs 
in POSIX 1003.1-1996 that should not be implemented. (These have been fixed 
for 1003.1-2001, but that's still in balloting and thus isn't yet really a 
standard.)

So what you really want is an implementation that's "sufficiently 
conforming" without being "fully conforming". So do you want one that does 
everything in the standard that SHOULD be done and nothing that SHOULDN'T 
be done? Is it a firm requirement that this implementation have no bugs, 
known or unknown? Yeah, that's a grin, but it's also serious, since an 
implementation with conformance bugs can't really be said to conform, at 
least by any abstract and objective definition of "conform". Pragmatically, 
the best objective use of the term would be to claim the UNIX 98 brand, 
proof of having passed the VSTH test suite; but that suite isn't perfect.

Once you loosen the bounds of "100% strict conformance", we get to the 
important issue... which is deciding what meets your actual needs. The 
current LinuxThreads implementation falls well short of full conformance; 
but while it fails to implement many features of the standard, it also so 
far as I know fails to implement any of the standard's bugs. For most 
applications, that implementation is going to be quite sufficient.

IBM is working on NGPT ("Next Generation" POSIX threads), which they 
claim will relieve most if not all of the conformance bugs in LinuxThreads. 
However, as far as I can tell (as it appears to require no substantial 
kernel changes) it will inevitably add a set of bugs that the developers 
apparently like (or at least accept), and will share many of the 
"weaknesses" (some of which many consider actual conformance bugs) of the 
Solaris and AIX two-level scheduler implementations. They appear to be 
doing this principally because current limitations of Java encourage a 
"thread per client" design pattern, and "one to one" kernel thread 
implementations such as LinuxThreads tend to perform poorly with 
unreasonably large numbers of threads. They will give up a lot to gain 
support for "thousands of threads" servers that violate many principles of 
good threaded design and probably won't work well anyway.

So... what particular conformance do you want? ;-)

Or, to put it another way... choose your own most personally useful 
definition of "conformance", and look for the implementation that most 
closely implements it.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q374: MT random number generator function. 

 

Chris M. Moore wrote:

> On Tue, 10 Jul 2001 17:30:36 -0700, TC Shen 
> wrote:
> 
>>Hello:
>>  Take this as an example, I need a random number generator function
>>like either randome() or rand() or rand_r() to be used in a
>>multi-threaded application
> 
> Generally, functions ending in _r are re-entrant i.e. MT-safe.

It's a little more complicated than that. When there is an _r version of a 
function, the original version has an interface (involving static data) 
that cannot trivially be made thread-safe. In such cases, POSIX added the 
_r version (with no implicit static data) and specified that the original 
form NEED NOT be made thread-safe.

On many implementations, those original functions nevertheless ARE 
thread-safe. For example, rand() could be written to synchronize access to 
a shared static seed, or to maintain the seed in thread-specific data. 
Either would be thread-safe, though the behavior would be quite different. 

The rand_r() interface, though, provides more flexibility. If you provide 
your own external synchronization, you can share the explicit context used 
by rand_r() between threads. You can also use multiple random number 
sequences in a single thread. And as long as you DON'T share the context, 
you have no overhead for unnecessary synchronization or thread-specific 
data indirection.

POSIX specifies that ALL ANSI C and POSIX functions are thread-safe, with 
the exception of a short list. (Mostly, though not exclusively, those 
replaced with _r variants.) The XSH6 specification (UNIX 98) requires that 
nearly all UNIX 98 interfaces (with a few more non-POSIX exceptions) must 
be thread-safe. This is important, as a lot of commonly used interfaces 
(for example, select) are not part of POSIX, but are in UNIX 98.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q375: Can the main thread sleep without causing all threads to sleep? 


> If a main() creates mutiple threads that are off executing their
> specified routines, can the "main()" thread then all sleep(3) with_out_
> causing any of the threads to sleep?
> 
> Esentially:
> 1) create and start threads
> 2) sleep (yourself) while threads do ob
> 3) wake up and stop the threads, clean-up, and end execution
> 
> I have looked in /usr/include/unistd.h and at the Single Unix Spec but
> am not sure of the behavior

You're asking two separate questions. Reference to the Single UNIX 
Specification (version 2, which includes threads; aka UNIX 98) is really 
off topic if you're interested in Linux, because Linux bears only vague and 
incomplete similarities to UNIX 98. (Especially when you get to threads.) 
In terms of UNIX 98 conformance, LinuxThreads is full of large and serious 
bugs. (Though this is a pointless and somewhat unfair criticism because no 
aspect of Linux actually claims "conformance"... rather, the code strives 
to be compatible where that's practical. This is a good goal, and only 
those who depend on it can really judge whether what they've achieved is 
"good enough".)

POSIX (and UNIX 98) absolutely require that sleep() function as specified 
for the calling thread without any effect at all on other threads. However, 
on implementations that don't/can't claim full POSIX conformance, sleep() 
is one of the functions voted most likely to be broken because the 
traditional implementation relies on SIGALRM and won't work correctly with 
threads. When such implementations support nanosleep(), that's more likely 
to work correctly, though there are no guarantees. (Again, POSIX requires 
that BOTH work correctly, and once someone's broken one rule, it makes 
little sense to bet they've followed other related rules... at least, don't 
bet very much.)

However, sleeping for a period of time after creating threads, and then 
assuming that those threads have done anything at all (much less finished) 
is extremely bad practice. Instead, set up some explicit synchronization. 
If you really expect the threads to complete, use pthread_join() to wait 
for them to finish. If you want them to sit there for some period of time 
(recognizing they may in fact have done absolutely nothing at the end) and 
then terminate, you can use pthread_cancel() (for example).

If, on the other hand, you want to be sure that the threads have "done 
something", but still make them quit after some reasonable period of time, 
set up your own synchronization protocol. Have the threads report on their 
progress in some shared data that the main thread can monitor. When they've 
"done enough" and you've waited "long enough", set some flag that tells the 
threads to terminate at the next convenient time. Or cancel them.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q376: Is dynamic loading of the libpthread supported in Redhat? 


In article <[email protected]>, CKime wrote:
>I am wondering if the dynamic loading of the libpthread that
>ships with Linux RedHat 6.2 is supported.

No. libpthread is integrated into libc. When a program is linked against
libpthread, the behavior of libc changes because libpthread overrides a
few symbols in libc. This provides thread safety to some internal modules
within libc (example: malloc and stdio become thread safe), and adds
some necessary multithreaded semantics to certain functions (example:
fork() calls pthread_atfork handlers, sets up threading environment in
child process).

Not only can you not dynamically load the threading library, but in
general you cannot dynamically load a shared library which uses threads
into an executable that was not compiled and linked for multithreading.

If some program is to support multithreaded plugins, it should be
compiled as a multithreaded application.
=================================TOP===============================
 Q377: Are reads and writes atomic? 


> Suppose an integer variable is shared between threads.  Is it safe to
> assume that reads and writes are atomic (assuming reads and writes are
> single instructions)?

How big is an int?

Does the machine provide instructions to read and write data with that size 
atomically?

Does the compiler always generate the appropriate instructions to read and 
write 'int' data atomically? (E.g., "load 32-bit" rather than 'load 64-bit' 
the enclosing 64-bit cell and mask/shift.)

Does the compiler/linker always ALIGN data of that size "naturally"?

(If not) Does the machine still read and write data of that size atomically 
when the alignment is not natural?

> I suspect the answer is 'no, no standard provide such a guarantee',
> but then I'd like to know on what, if any, kind of hardware I can
> expect it to fail.

You're trying to rely on a bunch of different guarantees from the hardware 
up through the compiler and linker. You won't find ALL of these guarantees 
in any single document. The C and C++ languages do NOT require that 
compilers must generate atomic access sequences even to datatypes to which 
the hardware may support atomic access, so the standards don't help; you 
need to find out from the particular compiler you're using, for the 
particular hardware you're using.

To sum it all up, while this sort of assumption will prove viable on many 
systems, it is 100% implementation-specific. How much portability do you 
require? To which implementations? What are the consequences of someone 
later porting to another implementation (without the guarantees you want), 
if the people doing the port fail to notice your dependency? Is it really 
worth the risk?

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q378: More discussion on fork(). 


> Kaz Kylheku wrote:
> [...]
>> Unlocking is problematic for implementors, because the mutexes may
>> contain stuff that only makes sense in the parent process, such as a
>> queue of threads that are blocked on the mutex, threads which only exist
>> in the parent process! An unlock operation could try to wake one of these
>> threads. Making it work requires contorted hacks, and overhead.

But it needs to be done, so the only real question is where the hackery and 
overhead lives. Doing it once, in a central place (the thread library) 
reduces the risk of errors and minimizes the overhead.

>> Also mutexes and other objects may contain internal finer-grained
>> lock variables, which are not properly taken care of during the fork.
>> (and taking care of them would require the interaction between fork and
>> every instance of a mutex).  The fork could happen exactly as some thread
>> in the parent has acquired some spinlock variable within a mutex.  The
>> child's unlock operation will then deadlock on the spinlock.
>> 
>> It makes much more sense to require the user to reinitialize the locks,
>> which will obliterate any parent-specific cruft they may contain,
>> and give them a new lease on life, so to speak. The same goes for any
>> sync objects which may be in arbitrary use in the parent when the fork
>> takes place and whose use is expected in the child.

No, it really doesn't. For one thing, that means that pthread_mutex_init() 
can't check for EBUSY. (Or, if you require the child cleanup to first 
destroy mutexes, that pthread_mutex_destroy() couldn't check for locked 
state, or waiters, etc.) Or, alternatively, that they would need to be 
somehow aware of the "odd" environment and behave differently... overhead 
completely wasted in normal calls. All of this is essentially what a normal 
POSIX implementation must do internally inside of fork() for the child, and 
it's easier for the thread library to do it "automagically" than to require 
each "module" of the application to do it and get it right. (In any case, 
the thread library, and C runtime, etc., must all do this to their own 
objects.)

> wouldn't it make much more sense if the standard would
> define a set of "fork-points" (e.g. fork itself, _mutex_lock,
> _mutex_unlock, cond_wait, _once, etc) which would allow impls.
> make consistent *total* replica (including all threads)
> once *all* threads meat each other at fork points; wouldn�t
> that approach make it possible to fork a multithreaded process
> without "manual" synchronization in prepare/parent handlers ?
> is it just "too expensive" or am i missing something else ?

What if some threads never reach such a point? How much does it cost the 
implemenation to check in each of those points... whether or not it'll ever 
be necessary?

How do you tell threads they've been "cloned" behind their backs? This is 
critical if they're working with external resources such as files (you 
don't usually want two threads writing identical data to the same file), 
but may be necessary even if they're not. Solaris foolishly designed UI 
thread fork() to clone all threads. It even "allowed" that threads 
currently blocked in syscalls capable of returning EINTR might do so in the 
child... an ugly concession to implementation simplicity of no real value 
to application developers. The only solution would be to run some form of 
"atfork" handler in each active thread so that it can decide whether to 
reconfigure or shut down. This would be far more expensive and more 
complicated than the current fork-to-one-thread and "single stream" atfork 
handler mechanism. (Not that I'm arguing the POSIX model is "simple"... 
it's not. But the alternatives are even worse.)

We really should have just left the original POSIX alone. When you fork, 
the child can't do anything but exec(); possibly after other calls to 
async-signal safe functions. (But nothing else.)

The POSIX 1003.1d-1999 amendment, by the way, adds posix_spawn() to combine 
fork() and exec() in a single operation. Not, of course, a "simple" 
operation, since in order to make this useful to shells, (which is 
essentially a minimum definition of "useful" in this context), 
posix_spawn() comes with a "veritible plethora" of ancillary operations and 
data types to specify changes to environment, files, and all the other 
"tweaks" a shell would commonly make between the child's return from fork() 
and the call to exec().)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q379: Performance differences: POSIX threads vs. ADA threads? 


@res.raytheon.com wrote:

> thanks in advance to any replies to this post, they are much
> appreciated.  this is the situation:  i am trying to debug a real-time
> system running on a solaris.  it was implemented using ada threads.  the
> question i have and the issue i am not to clear on is the relationship
> between posix threads and realtime systems.  is there an advantage to
> using posix threads besides portability?  i know that solaris implements
> posix and that there are posix bindings (florist) for ada.  i read
> somewhere that posix threads have less overhead than ada threads.  i
> guess my ultimate question is...  in a realtime system running on
> solaris, written in ada, would there be any significant performance
> difference in implementing posix threads over ada threads, or vice
> versa?

Though I don't know the internals of the Ada runtime you're using, most 
likely "Ada tasks" and "POSIX threads" are no different at all below the 
surface layers. Ada likely uses either the POSIX or the older UI thread 
interfaces to create and manage threads. (It could also use the lwp 
syscalls, as the Solaris thread librarys do... however that would cause 
problems, would have no real advantages, and wouldn't really change the 
fact that the bulk of the code would be common.) The only way Ada tasks 
would be "really different" would be if the runtime "imagines" its own pure 
user-mode threads, multiplexing them itself within a single "kernel entity" 
(no parallelism on an SMP, and little concurrency). If that's the case, run 
(don't walk) to a new Ada, because it's probably not even truly thread-safe.

So, presuming a reasonable Ada implementation that uses POSIX or UI thread 
interfaces, is there any advantage to going "behind the runtime's back" to 
create and manage your own POSIX threads? Depends a lot on which Ada, and 
what you're trying to accomplish. The behavior of the threads won't be 
fundamentally different. Your own POSIX thread won't "run faster" than a 
POSIX thread (or a UI thread) created by the Ada runtime.

The main disadvantage with the original Ada spec was that rendezvous is a 
nice simple model for ordinary client/server type interactions, but a lousy 
form of general synchronization... and that was all it had. Modern Ada (Ada 
95) provides far more general and flexible forms of synchronization through 
the protected types mechanism, and could be used to write applications that 
would match native POSIX thread C code. (Often with less effort, presuming 
you're comfortable with Ada 95, and with perhaps even greater portability, 
as long as you have an Ada 95 everywhere... ;-) )

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q380: Maximum number of threads with RedHat 255? 


>> With a RedHat 7.1 install (Intel architecture), the stock kernel comes
>> with a limit of 65535.  When running a simple program which sits in a loop
>> creating threads & detaching them, the maximum number of threads I can
>> reach is 255 - the same limit as a stock kernel from RedHat 6.1.  On a modified
> 
> 
> Too bad no one has an answer for this... I'm going to *attempt* to
> figure this one out since one of the programs I'm writing is coming to
> this same blockade.
> 

I imagine you hit the max user processes limit, you can change it with 
ulimit -u 'num' in bash or limit  maxproc 'num' under tcsh. You can also
change the default soft/hard limits in /etc/security/limits.conf as well.

 Kostas Gewrgiou
=================================TOP===============================
 Q381: Best MT debugger for Windows... 


ahem people,

Pardon this post, by me, a rather beginner to
multithreading, both under win32 and nix, but, IMO, it would
be IDIOTIC (!!) to use a win32 debugger when debugging win32
threads.  You want a debugger that can stop the state of the
system exactly where and when you want it, and do many many
other wonderful things, without relying on the win32
debugging API, then just use SoftIce!  I've used it
successfully for quite some time, enough to know that it is
the best debugger in the world :) It has versions for
win9x/NT/2000, and is much more reliable than any other
win32 debugger I've seen..

Regards,
E
=================================TOP=============================== 
 Q382: Thread library with source code ?  


Yang Ke  writes:
>    Is there any implementation of user-level thread library 
> with source code ? 

State Threads: http://oss.sgi.com/projects/state-threads/
Pth: http://www.gnu.org/software/pth/
Quickthreads: http://www.cs.washington.edu/research/compiler/papers.d/quickthreads.html
-- 
Michael J. Abbott        [email protected]        www.repbot.org/mike
=================================TOP===============================
 Q383: Async cancellation and cleanup handlers. 


In article <[email protected]>, [email protected] wrote:
>after a (likely naive) naive look at pthread_cleanup_push and _pop, i'm
>puzzled by the race condition of:
>
>pthread_cleanup_push(cleanup_mutex,&mylock;);
>/* what if we are cancelled here (requires async cancellation,
> * deferred is fine) */
>pthread_mutex_lock(&mylock;);

Code which requires cleanup of locks or resources should not be doing
async cancelation. Because the cleanup code has no way to even
safely execute, never mind determine what needs to be cleaned up!

>does posix require an implementation of pthreads to work around this
>condition?  or can one simply not safely use cleanup_* in the face of
>async cancellation?

POSIX does not require any functions to be async-cancel-safe, other
than the ones that manipulate the thread's cancelation state.

Fistly, Draft 7 defines ``Async-Cancel-Safe'' Function in 3.23 as:

    A function that may be safely invoked by an application while the
    asynchronous form of cancelation is enabled. No function is
    async-sancel-safe unless explicitly described as such.

Then, paragram 2.9.5.4 (Async-Cancel Safety) says:

    The pthread_cancel(), pthread_setcancelstate(), and
    pthread_setcanceltype() are defined to be async-cancel safe.
    
    No other functions in this volume of IEEE Std 1003.1-200x are
    required to be async-cancel-safe.

So, while you have asynchronous cancelation enabled, these are the only
three library functions you may use.
=================================TOP===============================
 Q384: How easy is it to use pthreads on win32? 

> ok, my question is this, how easy is it to use pthreads on win32? I mean,
> does anyone here use them or even recommend them above win32 threads? just
> thought I would post some questions on things I wouldnt mind some help
with

At my company we use the pthread implementation from
http://sources.redhat.com/pthreads-win32 extensively because we need a
portable thread library between various unix flavours and win32.

>
> 1.    are they just as fast as win32 threads?
>
> 2.    as they implemented as a wrapper for win32 threads underneath (hence
> on win32, they are a wrapper and         not a whole solution)

They are implemented as a wrapper for win32 threads, which is not unusual
for a portable thread lib. This causes some overhead, but is depends on your
application whether that is a problem or not. In one specific application
where we use the above mentioned pthread implementation, we had to resort to
win32 primitives for synchronisation because of performance aspects.

>
> 3.    Can I code on win32 using these pthreads and then with just a
> recompile on a linux platform with a properly         configured pthreads
> installed, my code will compile ok and run fine?

Usually no problem.


Wolf Wolfswinkel
 
> 1.    are they just as fast as win32 threads?

yep, as long as you're not creating/destroying many threads.

> 2.    as they implemented as a wrapper for win32 threads underneath (hence
> on win32, they are a wrapper and         not a whole solution)

Quite a thin wrapper actually, so not much overhead. I actually rebuilt the
library and compressed it also using UPX (just for laughs :-)) and it 
was 6K!

> 3.    Can I code on win32 using these pthreads and then with just a
> recompile on a linux platform with a properly         configured pthreads
> installed, my code will compile ok and run fine?

Yep I've been doing this for 2 years.

> 4.    does anyone have trouble using pthreads on win32?

Not me.

> well, I think thats about it, of course, you could understand the sorts of
> questions I'm trying to ask here and answer back something I havent asked,
> but would help me to know, that would be great
> 
> thanks for you help guys !
> 
> kosh

Padraig.
=================================TOP===============================
 Q385: Does POSIX require two levels of contention scope? 


> I have a question concernign pthreads and POSIX. Does POSIX require that
> there are two levels of contention scope supported
> (PTHREAD_SCOPE_SYSTEM, PTHREAD_SCOPE_PROCESS) for the pthreads? I
> understand that many platforms support the 1-1 model, where one
> user-level thread (ULT) maps to one kernel-level schedulable "entity".

POSIX requires an implementation to support AT LEAST ONE of the two 
contention scopes. While some may support both, that's not required. Also, 
the only way to determine which scopes are supported is to try creating the 
threads and see whether one or the other fails.

And note that anyone who can do "1 to 1" threads could easily "cheat" and 
claim to support both. There's nothing a legal/portable program can do to 
detect whether the implementation has "lied" and given it a system 
contention scope (SCS) thread when it asked for PCS (process contention 
scope). The converse isn't true: an implementation that gives a PCS thread 
when asked for an SCS thread is broken, and an application that really 
needs SCS can easily tell the difference.

(I don't see any point to such "cheating", and I certainly wouldn't endorse 
it. It's just an illustration of the fact that PCS is a way to enable the 
implementation to make certain optimizations, and places no functional 
limitations on the system.)

> Is anyone familiar with kernels (except Linux) supporting threads, that
> schedule processes only (as opposed to scheduling kernel threads which
> map to phtreads)? I remember some work in the early 90's about
> continuations where the kernel would employ call back functions for user
> level work within a process. Is anyone familiar with any pthreads
> implementation based on this or other similar mechanisms?

Nathan Williams already pointed out the U. Washington "Scheduler 
Activations" paper. The term "continuations" comes from a CMU research 
project applying the scheduler activations theory to the Mach 3.0 kernel. 
Their model allowed mixing scheduler activations with traditional 
synchronous kernel blocking states to ease the transition.

Both Solaris and Tru64 UNIX claim "inspiration" from the Scheduler 
Activations paper, but I know of nobody who's fully implemented scheduler 
activations (or continuations) in any "real" kernel. The Tru64 
implementation, at in terms of behavior if not detailed implementation, 
comes the closest to the ideal; we have much more complete, and tighter, 
integration and communication between the kernel and user mode schedulers 
than anyone else. Despite long conviction that this ought to be, as the 
original paper supposed, "the best of both worlds", that goal has proven to 
be elusive in the real world.

In theory, there are more or less 3 classes of application:

1) Compute-bound, such as highly parallel (HPTC) mathematical algorithms. 
The fact is that such applications don't care a bit whether their threads 
are SCS or PCS. They'll have one thread for each available processor, and 
those threads aren't supposed to block.

2) Synchronization-bound, threads that run a lot and compete for POSIX 
synchronization objects like mutexes and condition variables. These 
applications will generally run faster with 2-level scheduling than with "1 
to 1" scheduling, because they "never" need to cross the kernel protection 
boundary.

3) I/O-bound (communications), threads that do a lot of file or network 
operations, which always involve kernel boundary crossings. Since they 
usually block in the kernel, they'll usually run faster if all the 
scheduling activity occurs in the kernel. Web and mail servers tend to fall 
into this category. Many database engines do, too. In theory, the overhead 
imposed by scheduler activations (or some reasonable approximation, like 
ours) will be small enough that such applications won't be substantially 
impacted; and that overhead will be washed out by gains in whatever 
user-mode synchronization they also do. In practice, this is not 
necessarily true, because it's really tough to get the overhead down into 
the noise level. ("In theory, there's no difference between theory and 
practice; in practice, there's no similarity.")

We have found, in practice, that many of the important applications fall 
into either category 1 (HPTC) or category 3 (communications), where 2-level 
scheduling is, on average and in general, either of little benefit or 
amoderate penalty. We've been working on creative ways to decrease the 
"upcall penalty", but it takes a lot of work from both the user and kernel 
sides, and that can be difficult to schedule since everyone's always 
overworked. We haven't made nearly as much progress as we'd hoped.

People keep talking about "improving" Linux by adding 2-level scheduling, 
and right now I have to say that's nonsense. IBM wants it apparently just 
so they can support poorly designed Java applications that push the typical 
Linux configuration's limits on process creation. While there's something 
to that, the costs are absurd, and the confusion wrought by having two 
common thread packages will be severe. I'm sure there are easier ways to 
remove the configuration restrictions. I'd much prefer to see people 
focusing on fixing the kernel limitations that prevent Linuxthreads from 
fully conforming to POSIX and UNIX 98 so we can improve industry 
portability.

I'm certainly not ready to give up on 2-level scheduling, but it's not a 
simple or easily applied technology, and we're still working on 
understanding all of the details. We're pretty much convinced that "pure 
scheduler activations" isn't workable, but there are compromises and 
adaptations that we think would probably work. Unfortunately, changing an 
existing kernel to use scheduler activations is a really big job; maybe 
bigger than simply building a new kernel from scratch. The Mach 
continuations work is probably the best course, especially where Mach is 
already there (e.g., Tru64 and Darwin/Mac OS X); but raw Mach source is a 
long way from being "product quality", and it's not easy to patch 
continuations into a much-altered commercial quality "Mach based" kernel.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q386: Creating threadsafe containers under C++ 

> Does anyone have some tips/book/web-pages where I can find some ideas how
> to create threadsafe containers under C++ (like lists, trees, etc.)?
> I would like to have something like a threadsafe STL or so. The STL
> provides a very well structured set of useful containers and algorithms.
> But encapsulating them in a kind of threadsafe proxy is sometimes not good
> enough. I am familiar with common pattern for concurrent programming, but
> they again are too general.
> I would be interested in how to clearly provide a useful set of functions
> as programming interface.
>

 You can look at our STL thread-safe wrappers in GradSoft Threading.

http://www.gradsoft.com.ua/eng/Products/ToolBox/toolbox.html

 For example lookup in a list and insert if the element does not exist is a
> simple example; this should be part of the container class because it
> must be an atomar operation and thus cannot be solved by sequentially call
> lookup() and insert().
> 
> Thanks in advance for any hint,
> 
> Alex.
=================================TOP===============================
 Q387: Cancelling pthread_join() DOESN'T detach target thread? 

Alexander Terekhov wrote:

> the standard says:
> 
> "Cancelation Points
> 
>  Cancelation points shall occur when a thread is
>  executing the following functions:
>  .
>  .
>  .
>  pthread_join()..."
> 
> "If the thread calling pthread_join() is canceled,
>  then the target thread shall not be detached."
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Q) WHY ???

This is, as you guessed below, a requirement for the IMPLEMENTATION. What 
it means is perhaps a little complicated, and this is one of the reasons 
that the STANDARD isn't really intended as general user documentation.

Part of the confusion is historical, and would be more obvious to someone 
familiar with DCE threads, which was an implementation of an early draft of 
the standard document.

Originally, "detach" and "join" were separate operations. "Join" merely 
allowed a thread to wait until the target was complete. There was no reason 
that both threads 1 and 2 couldn't wait for thread 3 to complete. But in 
order to ensure that system resources could be reclaimed, ONE (and only 
one) thread would then have to "detach" thread 3.

In fact, nobody could ever construct a case where it made sense for both 
threads 1 and 2 to wait for thread 3 to complete. While there might be 
cases where thread 3 was doing something on which two threads depended, 
joining with it is never the right (or best) way to implement that 
dependency. (It really depends on some DATA or STATE, not on the completion 
of the thread, per se.) And it's silly to require two calls to "finalize" 
thread 3. Furthermore, one of the most common DCE thread programming errors 
was failure to detach after joining, resulting in a memory leak.

So it was decided that join should implicitly detach the target thread once 
it was complete. In fact, the detach operation, and all reference to it, 
was removed from the standard. (Over the objections of some of us who knew 
better.)

But there were two problems. There was no way to alter the initial thread 
of a process so that its resources would be automatically reclaimed on 
termination. And there was this nasty problem in join; where, if the join 
is cancelled, which implies it didn't complete, the target thread of the 
join would languish forever -- another memory leak about which the program 
could do nothing.

The solution, of course, was to restore the detach operation; even though 
it was now less useful (and more rarely needed) than before.

However, although the standard could now talk about detach, the description 
of join wasn't edited to make as clear as we might wish that, effectively, 
the join operation is "wait for termination and then detach". The standard 
says and implies in several places that one EITHER detaches OR joins, and 
only this discussion of the cancellation behavior of join suggests a 
connection or dependency.

> note that the standard also says:
> 
> "It has been suggested that a ''detach'' function is not
>  necessary; the detachstate thread creation attribute is
>  sufficient, since a thread need never be dynamically
>  detached. However, need arises in at least two cases:
> 
>  1. In a cancelation handler for a pthread_join () it is
>  nearly essential to have a pthread_detach() function in
>  order to detach the thread on which pthread_join() was
>  waiting. Without it, it would be necessary to have the
>  handler do another pthread_join()  to attempt to detach
>  the thread, which would both delay the cancelation
>  processing for an unbounded period and introduce a new
>  call to pthread_join(), which might itself need a
>  cancelation"
> 
> and
> 
> "The interaction between pthread_join() and cancelation
>  is well-defined for the following reasons:
> 
>  - The pthread_join() function, like all other
>    non-async-cancel-safe functions, can only be called
>    with deferred cancelability type.
> 
>  - Cancelation cannot occur in the disabled cancelability
>    state. Thus, only the default cancelability state need
>    be considered. As specified, either the pthread_join()
>    call is canceled, or it succeeds, but not both. The
>    difference is obvious to the application, since either
>    a cancelation handler is run or pthread_join () returns.
>    There are no race conditions since pthread_join() was
>    called in the deferred cancelability state."
> 
> so i am really puzzled by the restriction which does not
> allow me to detach the target thread. IMHO it should
> be declared that it is really safe (and even required)
> to _detach (or _join) the target thread "If the thread
> calling pthread_join() is canceled" !!
> 
> or am i missing something?

Yes, though you've guessed it below. The IMPLEMENTATION isn't allowed to 
detach the target thread when join is cancelled. But YOU can call 
pthread_detach() at any time, for any pthread_t value you hold. (You just 
can't join, or re-detach, a detached thread.)

> perhaps "shall not be detached" is meant with
> respect to implementations only; not with
> respect to applications which can request
> cancellation of joining thread(s)? (but then
> why implementation would ever want to "DETACH"
> when application have requested "JOIN" ?! i
> just do not see any connections between
> "detach" and "join" with respect to
> cancelation even from impl. point of view)

Again, the confusion is because "join" is really "wait for termination and 
detach". So the implementation MUST detach when the application requests 
"join"... except when the join operation is cancelled before the target 
thread terminates. Then YOU are responsible for detaching the target thread 
explicitly, in the cleanup handlers... or allow for the fact that it will 
continue to consume some process resources (including at least the thread 
stack, and possibly more).

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q388: Scheduling policies can have different ranges of priorities? 

Dale Stanbrough wrote:

> I read somewhere that different scheduling policies can have
> different ranges of priorities, and that symbolic values should
> be used to represent min/max priorities.
> 
> Is this correct, and if so where can i find the definition of
> these priorities?

Presuming that you're talking about POSIX threads, the only way to find the 
legal range of POSIX priority values for a given policy is by calling 
sched_get_priority_min() and sched_get_priority_max(). (Each takes a single 
int argument, the symbol for the policy, such as SCHED_FIFO, and returns an 
int priority value.) One minor complication is that these interfaces are 
under the _POSIX_PRIORITY_SCHEDULING option, and are not required for the 
_POSIX_THREADS option or the _POSIX_THREAD_PRIORITY_SCHEDULING options; 
thus, you could possibly find realtime threads scheduling support WITHOUT 
any supported/portable way to determine the priority ranges. (I actually 
tried to get that fixed, but nobody else seemed to care. In practice, 
they're probably right since you're unlikely to find "thread priority 
scheduling" without process "priority scheduling"; though it would have 
been nice to lose this inconsistency.)

On most systems, there are other ways to find out, either by symbols in the 
header files or documentation; but those methods aren't portable.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q389: The entity life modeling approach to multi-threading. 


I certainly don't pretend to be anywhere close to Don Knuth, but my web site
www.coloradotech.edu/~bsanden has a lot of material on the entity life modeling
approach to multi-threading. It is what you're asking for since it is at the
design level and not concerned with the details of any particular kind of
threads. www.coloradotech.edu/~bsanden/elm.html is an overview. There are
examples of multi-threading at www.coloradotech.edu/~bsanden/rtexamples.html and
various papers at www.coloradotech.edu/~bsanden/pubs.html.

Bo Sanden
Colorado Tech. University


Michael Podolsky wrote:

> Hi
>
> Does anybody know really good books about multithreading?
> Not about using pthreads or win32 APIs, not for beginners.
> Smth. near the level of Knuth multivolume book or GOF Design Patterns.
> Something full of wisdom :-) that really teaches multithreading programming.
>
> Or maybe you know about good web sites about multithreading?
>
> Thanks,
> Michael
=================================TOP===============================
 Q390: Is there any (free) documentation? 
 
>> Buy Dr. Butenhof's book.
>Is there any ***FREE*** documentation?

a)
http://www.lambdacs.com/cpt/FAQ.htm
http://devresource.hp.com/devresource/Docs/TechPapers/PortThreads.html
http://sources.redhat.com/pthreads-win32/
http://www.humanfactor.com/pthreads/
http://www.cs.msstate.edu/~cr90/doc/sun/POSIXMultithreadProgrammingPrimer.pdf
http://twistedmatrix.com/users/jh.twistd/cpp/moin.cgi/ThreadingArticles

b) Do your own research in future.

c) Buy the damned book.

[Let me repeat, Buy the damned book, Buy the damned book -Bil]
=================================TOP===============================
 Q391: Grafting POSIX APIs on Linux is tough! 

>> Grafting POSIX APIs on top of a kernel that refuses any support for POSIX
>> behavior is tough.
> 
> This discussion attracted my attention. I'm in the process of
> implementing a two-level pthread system, and deferred cancellation has
> recently popped up on my to-do list. The technique of wrapping system
> calls with a check, while doable, is unpleasant for performance
> reasons. However, I'm at a loss as to what I can make the kernel do
> that will help.
> 
> One of my design goals has been to keep the kernel ignorant of the
> upper (application thread) layer, so the kernel can't just check a
> flag in a thread's control structure on entry to read(), for
> example. Also, since there's no fixed binding between application
> threads and kernel execution contexts, it's not reasonable to set a
> per-KE flag saying "hey, something here wants deferred cancellation" -
> in addition, the flag would have to be set or cleared at every thread
> context switch until the target was cancelled.

Sorry, but you want my opinion, here it is. Any attempt at 2-level 
scheduling where the user and kernel level schedulers do not have constant 
and high-bandwidth communication to share information is completely broken 
and absolutely useless.

The quest to keep the kernel ignorant of the user scheduler is going off 
completely in the wrong direction. You want more and tigher coordination, 
not less.

If you're not willing to take on that challenge, then stick with 
LinuxThread-style "1-1" kernel threads. It's simple, it works. There's no 
such thing as a "simple" 2-level scheduler, and anything that comes even 
close to working is not only extraordinarily complicated, but extends 
tendrils throughout the kernel and development environment (debugging, 
analysis tools, etc.). Solaris has backed off on 2-level scheduling with 
the alternate "bound thread" library in Solaris 8 because, in my experience 
and opinion, they lacked the will to increase the integration between 
kernel and user schedulers to the point where it could approach "correct".

> Digital Unix is the only system I'm aware of that has similar
> constraints. What does it do?

We have the best 2-level scheduler -- arguably the only REAL 2-level 
scheduler in production UNIX -- precisely because we have tightly coupled 
integration between user and kernel. We're adding more, because we still 
suffer from disconnects in scheduling behavior. (For example, we don't have 
any way to know whether a kernel thread is currently running, or the actual 
instantaneous load of a processor.) We have the best architecture for 
debugging and analyzing applications using 2-level threads, but we continue 
to find subtle places where it just doesn't work, and it's not uncommon for 
me to go off and spend a month or two hacking around to isolate, identify, 
and fix some subtle bug.

Like everyone else around here, I started out as a big fan of 2-level 
scheduling. Heck, it's a wonderful engineering challenge, and a lot of fun. 
But I have become greatly unconvinced that the quest for perfect 2-level 
scheduling can be worth the effort. I can write any number of programs that 
will "prove" it's better... but I can just as easily write an equal number 
of programs that will prove the opposite. Given that it doesn't benefit 
HPTC, and may hurt (but at least doesn't help) most I/O bound server type 
applications, it's hard to rationally justify the substantial investment. I 
think, in particular, that the notion of adding 2-level scheduling to Linux 
is counterproductive. (I'm trying to restrain myself and be polite, though 
I'm tempted to use stronger words.)

The basic motivation for some of this work appears to be the ability to run 
"wild" Java apps with bazillions of threads without having to increase the 
Linux process configuration limits. Cute, but this is much more a "fun 
engineering experiment" than a practical solution to the problem. For one 
thing, Java or not, apps with that many threads will spend most of their 
time tripping over their own heels. It's largely a symptom of the lack of 
select() or poll() in Java, and I've seen some indication that this has 
finally been addressed.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
=================================TOP===============================
 Q392: Any companies  using pthread-win32? 

>Hello
>  I am trying to get a cross-platform thread package.
>
>  QThread released by TrollTech is a pretty stable implementation
>althought it does not provide enough interfaces to interact with the
>bottom layer pthread package or Win32 APIs but I am satisfied with its
>performance; besides, I can always pthread or WIN32 APIs to do what I
>want.  The only butterfly in the ointment is that my program has to be
>linked with the whole qt library even if I only use portion of it.
>
>  I am considering pthread-win32 port but I am not sure whether it is
>stable enought for commercial product; plus I am afraid a wrong decision
>will have group-wide effects.  Just wonder that is there any companies
>implementing their commercial product using pthread-win32?  How does
>pthread-win32 implement pthread_atfork()? 
>
>Thanks
>
>TCS

PocketTV (http://www.pockettv.com)

It is an MPEG Movie Player for Pocket PC devices (e.g.
iPaq, Cassiopeia, Jornada).

It is built on pthread-win32 (ported to Windows-CE).

Works flawlessly.

Thanks you all!

-t 
=================================TOP===============================
 Q393: Async-cancel safe function: guidelines? 

> 1.When a function claims to be async cancel safe, what are the
> guidelines by which it can be classified as async-cancel safe.

A function is async-cancel safe if asynchronous cancellation of such code
will not corrupt data or otherwise prevent continued execution of threads
within the process address space.

> Butenhof's book on pg 151 states that no function that acquires
> resources while aysnc cancel is enabled should be called.

I like to keep my sweeping statements relatively simple. Otherwise,
I'd've had to promise an infinite series of thread books delving into
minutia in awe-inspiring (and deadly boring) detail. (Come to think of
it, I've already written a lot of that "infinite series" in articles to
this newsgroup.)

All things are possible, including that some types of applications
(particularly in a monolithic embedded system environment) might be able
to be cleverly coded so that they could recover from asynchronous
cancellation of a thread that owned resources. YOU, however, cannot do
it; not on any general purpose UNIX system, anyway, because your
application depends on invariants and resources that you can't even see,
much less control. (For example, C stdio and malloc, C++ iostreams, POSIX
mutexes, etc.)

> But if I am implementing pthread_cancel which should be async cancel
> safe by POSIX, there are critical sections present in pthread_cancel
> implementation. But I should not use a mutex to lock, but some
> mechanism to disable a context switch. Is this correct??

Absolutely not. "Disabling context switch" is grotesque and almost always
improper.

There's absolutely no justification for making pthread_cancel()
async-cancel safe. It was stupid and pointless. An influential member or
two of the working group insisted, and some of us who knew better failed
to notice that it had been slipped in until it was too late to change.

One member of the group has since hypothesized (from vague memories) that
it was done so that the "master" of a compute-bound parallel loop team
(e.g., explicit OpenMP type parallelism) could decide to cancel the other
members of the group without needing to disable async cancelability.

Stupid idea. It's done, so it's going to need to disable async cancel
anyway. Furthermore, the only practical way to make pthread_cancel()
async-cancel safe (the only way to make any useful routine async-cancel
safe) is to DISABLE async cancellation on entry and restore the previous
setting on exit. That's what I did, and that's what I recommend you do.

Actually, I would prefer to recommend that you ignore the entire issue.
Anyone who calls pthread_cancel() with async cancelability enabled is a
fool, and their code probably won't work anyway. In fact, async
cancellation is dead useless. It was an idea POSIX inherited from our CMA
architecture, which had gone through years of theoretical evolution but
no practical application. Async cancelability had seemed theoretically
useful, but in retrospect it is not of any particular use, and far more
trouble than its worth. It shouldn't have been in the standard at all.
Even granting async cancellation, pthread_cancel() has no business being
async-cancel safe. Your implementation is quite unlikely to be 100%
POSIX conformant anyway, no matter what you do or test (even the UNIX 98
VSTH test suite, for which you'd have to pay The Open Group, isn't
perfect or complete), and this is such a trivial and stupid issue that
you should worry about it only if you're an "obsessive compulsive
completion freak" and you've got nothing else in your life to worry
about. ;-) ;-)

> 2.How can asynchronous cancellation be implemented? POSIX standard says
> that when cancelability type is PTHREAD_CANCEL_ASYNCHRONOUS, new or
> pending cancellation requests may be acted upon at "any" time. In
> letter not spirit of the standard, this can also be implemented as
> PTHREAD_CANCEL_DEFERRED,that is only when it reaches a cancellation
> point. But in spirit, would it have to be checked at a timer
> interrupt/scheduler reschedule?

Yes, that's the idea of the "at any time" loophole. Some implementations
might choose to defer delivery until a timeslice interrupt or at least
the next clock tick. You could also do it using a special signal (though
it more or less has to be a special OS-reserved signal to avoid polluting
the standard application signal number space). (There's no precise
guarantee when a signal will be delivered, either.)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/


>Hello,
>I am implementing a POSIX interface for a proprietary RTOS.
>The basic schedulable entity is thread, and the OS is completely written
>in C++.I am extending this by adding POSIX relevant functionality and
>providing a C interface,by internally instantiating the thread class.
>In this context, I have a few clarifications.
>
>1.When I do a pthread_cancel, I have to invoke the cancellation handlers
>if it has not been explicitly called by a pthread_cleanup_pop(nonzero
>value) and at pthread_exit.

Yes. Normally, the pthread_cleanup_pop() would do whatever it takes
to remove the cleanup from consideration.

>I have a plan of handling this using exceptions.On an ASYNC cancel, an
>exception object is thrown, which would contain the pointers to the
>cleanup handler function.

It's probably better to set up pthread_cleanup_push() and pthread_cleanup_pop()
as macros that open and close a statement block. In this statement block, they
can set up exception handling. E.g. if you are using purely C++ everywhere, and
do not plan to support C, then you could use C++ exception handling for these.

E.g. set up a class like this:

    class __pthread_cleaner {
        void (*__cleanup_func)(void *);
        void *__context;
        bool __do_execute;
    public:
        __pthread_cleaner(void (*__f)(void *), void *__c)
        : __cleanup_func(__f), __context(__c), __do_execute(true)
        {
        }
        ~__pthread_cleaner()
        {
        if (__do_execute)
            __cleanup_func(__context);
        }
        __set_execute(bool state) 
        { 
        __do_execute = state;
        }
    };

Now:

    #define pthread_cleanup_push(F, C) {    \
        __pthread_cleaner __cleaner(F, C);

    #define pthread_cleanup_pop(E)      \
        __cleaner.__set_execute((E) != 0);  \
    }


>At the exit point I have a catch clause for this exception object,which
>would then call the handler.
>Is this design proper, or are there any subtle issues, which should be
>handled to preserve POSIX semantics.
>
>1.One issue I am aware of is during the stack unwinding, only destructor
>of local objects would be invoked. Since the cancellation handlers are
>invoked at the end, and if the handlers makes a reference to some object
>which would have been destroyed,there would be a problem.

The right thing to do is to do everything in the proper nesting order.

>I am using a customized version of gcc.
>Is there any interface I have to add to gcc to handle this?
>
>2.The cleanup handler has to be invoked in user mode.As the exception is
>non resumptive, when I am in the catch clause which is placed in the
>kernel,the mode would have to be changed, else the handler would execute
>in kernel mode.

You need some operating system mechanism for passing the low level operating
system exception to the user space program. Then in the user space proram you
need to handle this and invoke the user space mechanism.

E.g. over a POSIX-like kernel, operating system exceptions are represented as
signals. So a signal handler can catch such signals an turn them into, say, C++
exceptions in some platform-specific way. The notification delivery semantics
of POSIX cancelation are similar to signals, and are in fact designed to be
implementable over signals.

>On a general note(probably OT) are there any guidelines for handling
>stack unwinding,in a mixed C/C++ call stack scenario.

They are compiler specific. Using gcc, there are ways to insert exception
handling hooks into C compiled code, because gcc and g++ have a common back
end. For example, the Bounds Checking GCC (GCC patched to do pointer arithmetic
checking) inserts construction and destruction hooks into functions which call
into a special library for tracking the existence of objects and their
boundaries. Ironically, this patch is currently only for the C front end. :)
That patch does it by actually modifying the compiler to insert the code; I
don't know whether it's possible to hack up some macros in GNU C itself to hook
into exception handling, without mucking with the compiler. Research item!
=================================TOP===============================
 Q394: Some detailed discussion of implementations. 

> >> Nevertheless, in any implementation crafted by "well intentioned,
> >> fully informed, and competent" developers, it must be possible to
> >> use pthread_atfork() (with sufficient care) such that normal
> >> threaded operation may continue in the child. On any such
> >> implementation, the thread ID of the child will be the same as in
> >> the parent,
> >...
> >> This may not apply to Linuxthreads, if, (as I have always assumed,
> >> but never verified), the "thread ID" is really just the pid. At
> >> least, pthread_self() in the child would not be the same as in the
> >> parent.
> >
> >In LinuxThreads, the `thread ID' is not just the pid, at least in this
> >context.  pthread_self() actually yields a pointer to an opaque
>
> In LinuxThreads, pthread_t is typedef'ed as an unsigned long. But you are
> right; it is converted to a pointer internally. LinuxThreads is handle based;
> there is a array of small structs, referred to aws handles, which contain
> pointers to the real thread descriptor objects. The thread ID is simply an
> index into this array.
>
> >> The thread library could and should be smart enough to fix
> >> up any recorded mutex (and read-write lock) ownership information,
> >
> >AFAICS, for the above reasons, such a fix-up is unnecessary in
> >LinuxThreads.
>
> Such a fix up simply is necessary, but is unsupported in LinuxThreads. The
> child should reinitialize the locks rather than unlock them. This is stated in
> the glibc info documentation (but I wrote it, so I can't appeal to it as an
> additional authority beyond myself :).  That is to say, your child handler of
> pthread_atfork() should use pthread_mutex_init() to reset locks rather than
> pthread_mutex_unlock().  All thread-related objects should be treated this way:
> condition variables, semaphores, you name it.
>
> The child process could inherit objects such as mutexes in an arbitrary state.
> For example, a locked mutex could have a queue of waiting threads, and a
> condition variable can have a similar queue. An unlock operation would then try
> to pass the mutex to one of these threads, with strange results.
> That is not to mention that some objects have internal locks which are not
> taken care of across fork. E.g. some thread could be inserting itself on
> a condition variable queue while the fork() happens. The child will then
> inherit a condition variable with a locked internal lock, and a half-baked
> wait queue.
>
> POSIX basically says that multithreading in the child process is not useful,
> because only async safe functions may be used, so the way I see it,
> LinuxThreads is off the hook for this one. But it does provide *useful*
> multithreading behaviors across fork, above and beyond POSIX, provided you
> follow its recommendations. You *can* use the pthread functions in the
> child.

Yes, there is a "loophole" (a big one) in the standard. It retains the 1003.1-1990
wording insisting that one cannot call any function that is not async signal safe
in the child. This completely contradicts all of the intent and implications of
the pthread_atfork() mechanism, giving you a clear and undeniable excuse for
avoiding the issue of allowing threaded programs to "keep on threadin'" in the
child.

But then, you're not doing that, are you? You clearly intend to allow continued
threading, but in a manner that clearly and broadly violates several aspects of
the standard. The loophole doesn't apply here, Kaz! (You're not alone here,
though. Even someone in the working group made this mistake in writing the
pthread_atfork rationale [B.3.1.3] by suggesting that some libraries might use
only the CHILD handler to reinitialize resources. That would be OK only if one
could know that state of the resources in the parent allowed reinitialization,
implying there were no threads or the resources hadn't been used, in either of
which cases "reinitialization" might be legal, but would be useless and
redundant.)

POSIX has strict rules about mutex usage. One cannot reinitialize a mutex without
destroying it. One cannot destroy a mutex that's locked, or "referenced" (e.g., by
a condition wait, or by threads waiting to lock). One cannot unlock a mutex one
doesn't own. Even if one could, the fact that the mutex is locked implies that
application data is inconsistent; "reinitializing" the mutex without repairing all
of that data (even if it was legal and portable) would just mean that incorrect
data is now properly protected. Most applications cannot reasonably be expected to
analyze and repair their data under such (essentially asynchronous) interruption
of synchronized activities.

An implementor is faced with a choice here. One can go with the strict POSIX and
say that pthread_atfork() handlers are the "POSIX appendix" -- theoretically
promising, but in practice useless. One must provide the interface, but admit the
implementation does not allow you to make use of it. The loophole indeed allows
this, though it seems a bit cowardly to me. (On the other hand, "getting it
right", even inside the thread and C runtimes, is a bloody nightmare; and often
cowards live longer and happier lives than heros. ;-) )

OR, one can implement the full (though poorly and incompletely stated) intent of
the working group. This means that one must be able to lock a mutex in the
PREPARE handler, and unlock it in both the PARENT and CHILD handler. That implies
that fork() must do whatever is necessary to make the mutex consistent and usable
in the child, including any steps required to "pass ownership" from the parent
thread to the child thread. (If the mutex was owned by any other thread in the
parent process, all bets are off.) It also means that any waiting threads, and any
recorded association with a condition variable wait, be "cleaned up" (removed) by
the thread library. Similarly, any condition variable wait queues must be cleaned
up, because the condition variable cannot be reinitialized if there are waiters.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q395: Cancelling a single thread in a signal handler? 


> Kaz Kylheku  wrote:
>
> : Again, like pthread_create, pthread_cancel is not an async-safe function that
> : can be called from a signal handler.
>
> Hmm, I've written code in a signal handler to cancel a single thread on
> shutdown. What's the possible side effects of doing this since a signal
> handler is not thread safe?

The consequences are undefined, unspecified, and unportable. It may work. It may
fail. It may cause a SIGSEGV. It may silently terminate the process with no core
file. It may crash the system. It may cause the programmer to break out in a rash.
It may cause the end of the universe, though that's perhaps slightly less likely.

You can't do it, so don't worry about what might happen if you do it anyway. The
actual answer will vary between platforms, between releases on the same platform,
and perhaps even from day to day due to subtle interactions you haven't
anticipated. Therefore, "I tried it and it worked" (which actually means "I tried
it and I didn't observe any overt evidence that it didn't work") isn't good
enough. Sometimes "pushing the limits" just doesn't make any sense. Dangerous
coding practices like this will often work fine during development, and even
during controlled testing, and will fail on the site of a critical customer on
their production system under heavy load, which they cannot allow you to take down
or closely analyze.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q396: Trouble debugging under gdb on Linux. 


>I'm having a lot of trouble debugging a multi-threaded
>application (a CORBA app) under gdb on Linux (Mandrake 7.1). 
>When I set a breakpoint and a thread hits it, gdb shows me a
>different thread's stack, one that isn't anywhere near the
>breakpoint.

Try:
http://www.kernelnotes.org/lnxlists/linux-kernel/lk_0007_03/msg00881.html
and related messages.

Bye
Thomas 
=================================TOP===============================
 Q397: Global signal handler dispatching to threads. 


>    I am fairleynew to threads. I have heard it mentioned that you can have
>a global signal handler which will disbatch signals to other threads in an
>app. I would like to know how this is achieved. I have tried adding a
>signall handler for each thread when it starts and they do not seem to
>recieve the signalls when I send them (via kill). I was Thinking if you kept
>track of all pid which got started you could then send them signals but as I
>stated earlier the threads doen't seem to recieve the signals. If someone
>could explain this with a simple code snippet I would appreciate it.

First off, since you didn't say what specific pthreads implementation
you're using, I'll be answering for POSIX; there are signal weirdnesses
in most (all?  probably not) implementations because it's hard to get
right, and some believe that the POSIX spec itself did not get some
things right.

Anyhow:  Signals and threads are an ugly mix.  You've got the
asynchronicity of threads multiplied by the asyncronicity and complex
rules (though a lot of people don't seem to know them) of signal
handlers.  Most high-performance thread implementations start doing
ugly things if you call threading functions inside signal handlers.

In POSIX threads, the signal disposition (i.e. handler) table is a
process resource.  Thus it is not possible to have different actions
for the same signal in different threads without a lot of hard,
probably non-portable, work.

The signal *mask* on the other hand (blocked signals) is per-thread,
so that each thread may indicate what signals may be delivered in its
context.

The easiest way to deal with signals in multithreaded systems is to
avoid them if at all possible.  Barring that, have the initialization
code block all signals before creating any threads, and have one thread
dedicated to doing sigwait() for interesting signals.  This removes
the (abnoxious?) multiplicitive asynchronicity.

Signal delivery in a POSIX environment is quite simple to describe,
but has lots of subtle consequences.  When a signal is generated 
asynchronously to a process (i.e. sent with kill(), sigqueue(), or
one of the POSIX.1b functions that can cause a signal to happen),
the following steps occur:

1.  If any thread is in a sigwait(), sigtimedwait() or sigwaitinfo(),
    that thread will synchronously "accept"[1] the signal.
2.  The system looks at the signal masks of all threads in the process,
    in unspecified (i.e. random) order, looking for a thread that has
    that signal unblocked.  The signal will be "delivered"[1] in the context
    of one of the threads that have the signal unblocked.
3.  If none of the threads are accepting the signal or have it unmasked,
    the signal will remain pending against the process, and the first
    thread that unmasks or accepts the signal will get it.

Now, for synchronous signals (i.e. those that are attributable to a
specific thread[2]) and signals sent via pthread_kill to a specific
thread, it's a slightly different story.  The signal is delivered
(or made pending) to that thread, and will only be deliverable (or
acceptable) from that thread.  Note also that blocking or SIG_IGNing
SIGFPE, SIGILL, SIGSEGV, or SIGBUS causes undefined behavior unless the
signal in question is generated by sigqueue(), kill(), or raise().

And finally, you asked about a global signal handler that dispatches to
other threads.  The way that one would implement that would be to have
a single thread sigwait()ing on everything, and when it gets one, does
some amount of magic to figure out what thread needs to be kicked, and
use some prearranged inter-thread mechanism (probably *not*
pthread_kill()!) to let the victim know.


There, did that make your head hurt?



[1] These are POSIX-defined terms.  The phases of a signal's lifetime
    are reasonably called out in POSIX.

[2] Whether alarm() expiration is attributable to a specific thread is
    up for some debate.  POSIX gives illegal instructions and touching
    invalid memory as examples of signals attributable to a specific
    thread.
=================================TOP===============================
 Q398: Difference between the Posix and the Solaris Threads? 

> >Hi,
> >What is the difference between the Posix and the Solaris Threads?
>
> Solaris Threads is really called UI (Unix International) Threads.
> It predates POSIX threads. The API was developed by AT&T; (maybe also
> Sun, I'm not sure). The header file (/usr/include/threads.h) could
> be found on many SVR4 unixs, although rather confusingly, the actual
> library (/usr/lib/libthread.so) was shipped with far fewer. I
> rather presumed from this that it was up to each vendor to actually
> implement their own libthread.so, and more significantly the kernel
> support underneath, but most SVR4 vendors didn't bother; Sun Solaris
> and Unixware [Novell at the time] were the only two that did AFAIK.

Sun put together a "real thread package" for the followon to their 4.x
series systems, which had contained a purely dreadful attempt at a
user-mode thread package (confusingly called "LWP", the same as the name
given to the "kernel entities" supported later). When they started working
with USL (the semi-autonomous arm of AT&T; tasked with controlling System V
UNIX) and later UI (UNIX International, the even more semi-autonomous group
put together to try to separate yet more from AT&T; control in answer to the
apparently perceived threat of OSF) on what became "Solaris 2" (the new
SVR5-based Solaris kernel, internally designated "5.0"), the new thread
package became part of SVR5, in an update designated SVR5 4.2MP. The
interface developed along with the early drafts of POSIX, which is why
there's so much similarity.

Whereas POSIX put all the "core" thread functions into , UI
threads has several headers, including the (mis-)referenced  and
 for mutex and condition operations.

By the end of the process, the avowed goal of UI threads was to have a
short-term "standard" they could use while the long process of getting
industry-wide concensus on POSIX slogged onward. This was, in many ways, a
better strategy than OSF's, which was to implement an early draft (draft 4,
aka "DCE threads") of the POSIX spec and ship that. Having a different name
space avoided a lot of confusion, and made it easier to support both once
the standard came out. (That's a major reason why Solaris was the first
major UNIX OS to ship with POSIX support; most of the others had been using
OSF's "DCE threads", and had a lot more compatibility issues to resolve.)

> The two APIs are very similar - in some areas pretty much just different
> function call names. However, there are some underlying differences
> in symantics when you get into the details, such as the behaviour of
> fork() / fork1() and SIGALRM in the two models, so you can't simply
> convert one to the other with a few global exchanges.

UI threads insisted on the utterly silly, complicated, and ultimately
pointless "forkall" model, where your child process is transparently fired
up with a whole set of asynchronous cloned threads in unknown states, that
have no way of knowing they're running in a different address space. This
was intended to "relieve the programmer" of dealing with POSIX "atfork"
handlers, but (though atfork handlers are hardly a perfect solution, or
necessarily easy to use), "forkall" is far worse.

UI threads had "per-LWP" delivery of SIGALRM. Rather pointless, except
perhaps when a UI/POSIX thread is "bound" (POSIX system contention scope)
to the LWP. POSIX refused to accept this, as it makes for a more
complicated signal model at the program level. (We didn't like different
asynchronous signals behaving differently.) Solaris has been announcing for
quite a while now that the per-LWP timer signals are an obsolete feature
that will be going away (for UI threads as well as POSIX threads). As with
many other obsolete features, whether it actually ever WILL go away is
anyone's guess.

> Anyway, POSIX threads has now taken over; no one should be writing
> code using UI threads any more.

Well, if there is still some release, somewhere, that has UI threads but
not POSIX threads, and if you were writing code for that system that didn't
need to be portable, or needed to port only to other SVR5 4.2MP systems
(e.g., Solaris and/or UnixWare), you might have a convincing argument for
using UI threads. ;-)

Other than that, yeah; use POSIX threads.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q399: Recursive mutexes are broken in Solaris? 


Do a google/deja search on this newsgroup for:
"solaris recursive mutex broken"
and you'll see a discussion of the details of this, I believe with
links to sun's patches for this problem.

Beware that even with this patch, you may not use recursive mutexes
in the pthread_condition* calls at all in solaris, even I believe in
solaris 8, even if the lock count of the mutex never goes above 1 lock.


"Roger A. Faulkner"  wrote in message
news:[email protected]...

[snip]

> You are almost certainly a victim of this bug:
>
> Bugid: 4288299
> Synopsis: Recursive mutexes are not properly released
>
> This has been patched, at least in Solaris 8.
> Sorry, I don't know the patch-id.
>
> Roger Faulkner
> [email protected]
>
>

=================================TOP===============================
 Q400: pthreads and floating point attributes? 

> Does the pthreads standard have anything to say about whether threads
> inherit the floating point attributes of the main program?
>
> In particular, I am working on a system that implements the
> fesetflushtozero() function to change the underflow behavior of IEEE
> floating point from gradual to sudden. The man pages on this system
> say that this function has been adopted for inclusion in the C9X
> standard. What I need to know is whether the effect of calling this
> function in the main program should be inherited by threads that are
> subsequently created by the main program?  On the system that I am
> working on, this turned out not to be the case, and before reporting
> this as a bug, I need to know whether this constitutes a bug, from the
> standpoint of the pthreads standard, or whether this is just a quality
> of implementation issue.

POSIX 1003.1c-1995 (and UNIX 98) doesn't say anything about floating
point state, because that's not really a portable concept. (Some machines
might not even HAVE floating point. Hey, POSIX doesn't even insist that a
machine have a STACK.)

On an implementation for a machine that does have software-controllable
floating point state, it might make sense for the state to be inherited
by a created thread. On the other hand, from a quick reading of the C99
description of the floating-point environment functions, I can't really
infer what the designers of the C99 features might have wished when
they're extended into multiple threads. (By the way, I don't see any
mention in C99 of fesetflushtozero(), nor any way to control the behavior
of IEEE underflow except to enable or disable reporting of the exception.
I don't know whether this is a proprietary EXTENSION to ANSI C99 on your
platform, [which they might have hoped would be added to the standard],
or a preliminary feature in "C9X" that was removed before the final
version.)

I can't say whether the behavior you infer is "a bug", but it's
definitely NOT in violation of POSIX or UNIX 98. You don't way what
platform you're using, so it's not possible for anyone to say
authoritatively whether that platform is expected to behave the way you'd
like (which would make the behavior a bug), or whether it's supposed to
behave the way you observe (in which case its a feature you don't happen
to find convenient).

We're in the process of finishing up the new versions of POSIX and UNIX.
They are now based on C99, but there's no specification of how (or if)
the C99 floating-point environment is inherited, by fork, exec, or
pthread_create. While both exec and fork have "catch all" phrases that
end up requiring the state to be inherited, there's no such statement in
pthread_create, which makes the behavior undefined. (And up to the whims,
or oversights, of the implementor.) As I said, I'm not really sure what
the designers of the C99 features might have wanted in each of these
cases. I think that what seems to make sense to me would be for fork and
pthread_create to inherit, but for exec to reset to the system default.

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q401: Must SIGSEGV be sent to the thread which generated the signal? 


> In comp.programming.threads Florian Weimer  wrote:
>
> : SIGSEGV is sent to the thread which generated the signal.  Does the
> : standard mandate this?
>
> The POSIX standard mandates that a thread that misbehaves and causes a
> SIGSEGV to be generated will have that signal delivered to the thread in
> question.

More generally, any signal caused by the direct and immediate action of a
thread must be delivered to the thread. These are "thread directed" signals,
and include SIGSEGV, SIGBUS, SIGFPE, anything sent by pthread_kill() or
raise(). Signals that are directed at the process, though, cannot be
restricted to a particular thread. This includes kill(), and the normal
"asynchronous" external signals such as SIGCHLD, SIGIO, SIGINT, etc.

> Most other signals will be delivered to a random thread that does not have
> the signal masked.

This is the crux of the problem with the Linux implementation. If the PROCESS
receives a signal, and it happens to come into a thread (Linux process) that
has the signal blocked, the PROCESS cannot receive the signal until the
THREAD unblocks it. This is bad.

POSIX requires that a PROCESS signal go to any thread that does not have the
signal blocked; and if all have it blocked, it pends against the PROCESS
until some thread unblocks it. (Linuxthreads can, to some extent, fake this
by never really blocking signals, and having the handlers echo them to the
manager thread for redirection to some child that doesn't have the signal
blocked... but that's complicated, error prone, [e.g., the signal will still
EINTR a blocking call, whereas a blocked signal wouldn't], and all that
echoing is relatively slow.)

Lots of people may disagree with the POSIX model. It was the hardest part of
the standard. We fought for years before Nawaf Bitar fearlessly lead (and
cattle-prodded) the various camps into "the grand signal compromise" that
ended up forming the basis of the standard. Still, the major alternative to
what we have would have been a model based on "full per-thread signal state",
and that would have made a real mess out of job control because, to stop a
process, you'd need to nail each thread with a separate SIGSTOP...
problematic if it's dynamically creating threads while you work. (And, sure,
there are infinite shadings between the extremes; but that's where people
really started knocking each other over the head with chairs, and it just was
not a pretty scene.)

/------------------[ [email protected] ]------------------\
| Compaq Computer Corporation              POSIX Thread Architect |
|     My book: http://www.awl.com/cseng/titles/0-201-63392-2/     |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

=================================TOP===============================
 Q402: Windows and C++: How? 


%     I am writing a program for windows 98 with visual c++.  I want to create
% a thread : what is the best solution : beginthread or createthread and what
% is the difference between them ?  When I want to use beginthread the
% compiler says he doesn't know this.

It's _beginthread. You should use _beginthread or _beginthreadex because
it's the C library interface to CreateThread. There's no guarantee that
C library functions will work correctly if you use CreateThread. It doesn't
really matter in what way they will fail to work correctly, you might as
well just use _beingthread.

% So I've used CreateThread.  I wanted to add my thread-routine to a class but
% this didn't work.  Has anyone a solution for this ?.

This is a quite commonly asked question, which to me suggests that
a lot of people ought to think more carefully about what they're doing.
The thread function has to have a particular prototype, and if you
try to use a function with a different prototype, chances are that it
won't work. C++ is particularly anal about ensuring that prototypes
match (sometimes to the point where they render the prototypes useless --
the other day, I had a C++ compiler reject a char * argument to a function
because the function expected const char *), so it won't compile if you
use a function with the wrong prototype, and class member functions
never qualify as having the right prototype.

% Because I couldn't add
% my thread routine to a class I declared it as a friend.  But I have to give

This is the right thing to do.

% an argument by reference (my class) to my thread-routine and I don't know

But you can't do this. You must pass a void pointer. You can make this a
pointer to an object and cast it on the call and in the function itself.
Oh, looking at your code, you're not doing a C++ pass by reference,
so what you're doing is OK.

% Cserver :: Cserver()
% {
%  HANDLE hThread;
%  DWORD dwThreadId, dwThrdParam = 1;
%  hThread = CreateThread(NULL , 0 , &initialisatie; , this , 0 , &dwThreadId;);

Strictly speaking, this is not legal. You can't start using the object
until the constructor returns.

% DWORD WINAPI initialisatie(void * LpParam )
% {
%  Cserver * temp = (Cserver *)(LpParam);
% }

And this isn't doing anything. You really ought to have a temp->something()
in here to actually use the class. If you post an example showing
the problem, it'll be possible to pick nits with it.


--

Patrick TJ McPhee


You should always use _beginthread or _beginthreadex.  (notice the "_"
before the function name that is why your compiler didn't 'know'
beginthread, it 'knows' _beginthread)  Don't forget to include.
They call CreateThread internally.  The difference is that _beginthread and
_beginthreadex perform cleanup of the standard C libraries on termination.
You can not have the threaded function be a regular member of the class
because member functions are invisibly passed the this pointer as part of
the argument list.  You CAN make it a static member function and pass it the
this* as the argument to the thread.  If there is other data you need passed
to the thread create a structure and pass a pointer to the structure (create
it using 'new' and 'delete' it in the thread after you extract the values
you need.  What I normally do to keep the code clean and easy to read is
create a startic start function to be executed as the thread then I call my
member function from there as the only instruction.
I use _beginthreadex because it's more like CreateThread and it allows for
more control under NT.
Like this:

class CThreadedClass;

typedef struct{
    int Number;
    int Value;
    CThreadClass *pMe;
} MYDATA, *LPMYDATA;

class CThreadedClass
{
public:
    CThreadedClass() : m_nNum(0), m_hThreadHandle(0) {};
    ~CThreadedClass() { if(m_hThreadHandle) CloseHandle(m_hThreadHandle); }
    static unsigned long StartThread(void *pVoid);
    static unsigned long StartThread2(void *pVoid);
    unsigned long WorerkThread();
    unsigned long WorkerThread2(int x, int y);

    void Begin();

private:
    long m_nNum;
    HANDLE m_hThreadHandle;
    HANDLE m_hThreadHandle2;
};

void CThreadedClass::Begin()
{
    unsigned lnThreadID;
    m_hThreadHandle = (HANDLE) _beginthreadex(NULL, NULL, StartThread, (void
*) this, NULL, &lnThreadID;);

    //allocate mem for the dtruct dynamically
    //if you just create it locally it will be destroyed when the function
exits and the thread has a pointer to invalid data
  LPMYDATA lpData = new MYDATA;

    lpData->pMe = this;
    lpData->Number = 20;
    lpData->Value = 6345;

    m_hThreadHandle2 = (HANDLE) _beginthreadex(NULL, NULL, StartThread,
(void *) lpData, NULL, &lnThreadID;);
}

unsigned long CThreadedClass::StartThread(void *pVoid)
{
    CThreadedClass *lpMe = (CThreadedClass *) pVoid;

    return lpMe->WorkerThread();
}

unsigned long CThreadedClass::StartThread2(void *pVoid)
{
    LPMYDATA *lpData = (LPMYDATA) pVoid;
    int a = lpData->Number;
    int b = lpData->Value;
    CThreadedClass *lpMe = lpData->pMe;

    //cleanup
    delete lpData;

    return lpMe->WorkerThread2(a, b);
}

unsigned long CThreadedClass::WorkerThread()
{
    for (int i = 0; i < 20000;  i++)
        m_nNum += i;
    return 0;
}
unsigned long CThreadedClass::WorkerThread2(int x, int y)
{
    for (int i = x; i < y;  i++)
        m_nNum += i;
    return 0;
}

One call to Begin() and your thread is off and running.  You can easily
access public and private members without referencing a pointer to your
object.  Much cleaner to debug and read (especially when going back to your
code in a few months).

Hope that helps,
Jim



Microsoft has bungled this in a bad way. If you use CreateThread, and also use
some functions within the C library, they will allocate thread-local storage
which is not destroyed when the thread terminates.  If you use _beginthreadex,
the C library will call CreateThread for you, and direct that thread to its own
thread startup function, which then calls your thread function. When your
thread function returns, it will pass control back to the internal startup
function will clean up the thread-local storage.  This is due to the
brain-damaged Win32 TLS design, which does not allow thread local storage keys
to be associated with destructor functions that are called when the thread
terminates.

What's even more braindamaged is that _beginthreadex returns the thread handle
cast to an integer. You can cast that back to a HANDLE.  Avoid _beginthread, it
closes the thread handle internally after making the thread, making it
impossible for you to wait on the handle to synchronize on the thread's
termination.

>So I've used CreateThread.  I wanted to add my thread-routine to a class but
>this didn't work.  Has anyone a solution for this ?

This question is asked about twice weekly. Search the newsgroup archives.  It's
probably in the FAQ by now.

=================================TOP===============================
 Q403: I have blocked all signals and don't get SEGV! 


>A thread starts executing code and encouters a bad pointer (non-null but
>rubbish anyway)
>
>NO SEGV or BUSERROR occurs, the thread simply stops executing.
>
>I have blocked all signals and work with a signal tread (with sigwait)
>
>Why doesn't the application crash when it accesses the bad pointer and
>rather stops executing the offending thread.

The application doesn't crash because you've blocked all signals.  I
suspect what's really happening (and this is an almost-bug in many
OSes) is that the trap gets taken, the kernel code sends a SIGSEGV
or SIGBUS to the intended target, the target has that signal blocked,
so it gets marked as pending, and then the target thread is resumed,
only to reexecute the instruction that causes the fault.  Lather,
rinse, repeat.  I suspect you'll see that thread getting a fair
amount of CPU time.

The reason your sigwait()ing thread is not seeing the signal is
that synchronous signals are *always* sent directly to the thread
that generates the fault.  Thus, if SIGSEGV was unblocked and
handled in the thread that caused the problem, the "normal" control
flow *of that thread* would be interrupted to execute the signal
handler.

Some OSes have code that says "hey, wait, I just sent that very same
synchronous signal for that very same trap" and forcibly unmask
(and, usually SIG_DFL) the signal to get the offender out of the
process table.

It is possible to get caught on that code, but quite rare.
-- 
Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 14.9"
=================================TOP===============================
 Q404: AsynchronousInterruptedException (AIE) and POSIX cancellation 


BTW. I noticed a discussion thread from a couple of months ago regarding
AsynchronousInterruptedException (AIE) and POSIX cancellation. FYI you can
not use pthread_cancel to implement the async interrupt. AIE is just an
exception (with some special propagation features) and can be caught. As
always with Java, interrupt implies cancellation of the work not the worker.
It *may* lead to the thread terminating but it doesn't have to. In contrast
POSIX cancellation is termination of the thread. You can still subvert a
POSIX thread by going off and doing arbitrary stuff in one of the
cancellation handlers, but it would be very difficult (I'd almost say
impossible) to map the exception based mechanism over that - you still have
to deal with finally blocks, lock releasing etc as the call stack unwinds.

Cheers,
David
=================================TOP===============================

What is RISC?

2001-07-29 08:00:00

This is an archive of a series of comp.arch USENET posts by John Mashey in the early to mid 90s, on the defnition of reduced instruction set computer (RISC). Contrary to popular belief, RISC isn't about the number of instructions! This is archived here since, at least once a year, I see someone argue that RISC is obsolete or outdated because their understanding of RISC comes from the name, not from what RISC actually is. This is arguably a sign that RISC is very poorly named, but that's a separate topic.

PART I - ARCHITECTURE, IMPLEMENTATION, DIFFERENCES

WARNING: you may want to print this one to read it... (from preceding discussion):

Anyway, it is not a fair comparison. Not by a long stretch. Let's see how the Nth generation SPARC, MIPS, and 88K's do (assuming they last) compared to some new design from scratch.

Well, there is baggage and there is BAGGAGE. One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:

a) Architectures persist longer than implementations, especially user-level Instruction-Set Architecture.
b) The first member of an architecture family is usually designed with the current implementation constraints in mind, and if you're lucky, software people had some input.
c) If you're really lucky, you anticipate 5-10 years of technology trends, and that modifies your idea of the ISA you commit to.
d) It's pretty hard to delete anything from an ISA, except where:
     1) You can find that NO ONE uses a feature (the 68020->68030 deletions mentioned by someone else).
     2) You believe that you can trap and emulate the feature "fast enough", e.g., microVAX support for decimal ops, 68040 support for transcendentals.

Now, one might claim that the i486 and 68040 are RISC implementations of CISC architectures ... and I think there is some truth to this, but I also think that it can confuse things badly:

Anyone who has studied the history of computer design knows that high-performance designs have used many of the same techniques for years, for all of the natural reasons, that is:

a) They use as much pipelining as they can, in some cases, if this means a high gate-count, then so be it.
b) They use caches (separate I & D if convenient).
c) They use hardware, not micro-code for the simpler operations.

(For instance, look at the evolution of the S/360 products. Recall that the 360 /85 used caches, back around 1969, and within a few years, so did any mainframe or supermini.)

So, what difference is there among machines if similar implementation ideas are used?

A: there is a very specific set of characteristics shared by most machines labeled RISCs, most of which are not shared by most CISCs.

The RISC characteristics:

a) Are aimed at more performance from current compiler technology (e.g., enough registers).
OR
b) Are aimed at fast pipelining in a virtual-memory environment with the ability to still survive exceptions without inextricably increasing the number of gate delays (notice that I say gate delays, NOT just how many gates).

Even though various RISCs have made various decisions, most of them have been very careful to omit those things that CPU designers have found difficult and/or expensive to implement, and especially, things that are painful, for relatively little gain.

I would claim, that even as RISCs evolve, they may have certain baggage that they'd wish weren't there ... but not very much. In particular, there are a bunch of objective characteristics shared by RISC ARCHITECTURES that clearly distinguish them from CISC architectures.

I'll give a few examples, followed by the detailed analysis:

MOST RISCs:

3a) Have 1 size of instruction in an instruction stream
3b) And that size is 4 bytes
3c) Have a handful (1-4) addressing modes) (it is VERY hard to count these things; will discuss later).
3d) Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory)
4a) Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory. (note: this means especially avoiding operations that use the value of a load as input to an ALU operation, especially when that operation can cause an exception. Loads/stores with address modification can often be OK as they don't have some of the bad effects)
4b) Have no more than 1 memory-addressed operand per instruction
5a) Do NOT support arbitrary alignment of data for loads/stores
5b) Use an MMU for a data address no more than once per instruction
6a) Have >=5 bits per integer register specifier
6b) Have >= 4 bits per FP register specifier

These rules provide a rather distinct dividing line among architectures, and I think there are rather strong technical reasons for this, such that there is one more interesting attribute: almost every architecture whose first instance appeared on the market from 1986 onward obeys the rules above ... Note that I didn't say anything about counting the number of instructions...

So, here's a table:

C: number of years since first implementation sold in this family (or first thing which with this is binary compatible). Note: this table was first done in 1991, so year = 1991-(age in table).
3a: # instruction sizes
3b: maximum instruction size in bytes
3c: number of distinct addressing modes for accessing data (not jumps). I didn't count register or literal, but only ones that referenced memory, and I counted different formats with different offset sizes separately. This was hard work...Also, even when a machine had different modes for register-relative and PC_relative addressing, I counted them only once.
3d: indirect addressing: 0: no, 1: yes
4a: load/store combined with arithmetic: 0: no, 1:yes
4b: maximum number of memory operands
5a: unaligned addressing of memory references allowed in load/store, without specific instructions
0: no never (MIPS, SPARC, etc)
1: sometimes (as in RS/6000)
2: just about any time
5b: maximum number of MMU uses for data operands in an instruction
6a: number of bits for integer register specifier
6b: number of bits for 64-bit or more FP register specifier, distinct from integer registers

Note that all of these are ARCHITECTURE issues, and it is usually quite difficult to either delete a feature (3a-5b) or increase the number of real registers (6a-6b) given an initial isntruction set design. (yes, register renaming can help, but...)

Now: items 3a, 3b, and 3c are an indication of the decode complexity 3d-5b hint at the ease or difficulty of pipelining, especially in the presence of virtual-memory requirements, and need to go fast while still taking exceptions sanely items 6a and 6b are more related to ability to take good advantage of current compilers.

There are some other attributes that can be useful, but I couldn't imagine how to create metrics for them without being very subjective; for example "degree of sequential decode", "number of writebacks that you might want to do in the middle of an instruction, but can't, because you have to wait to make sure you see all of the instruction before committing any state, because the last part might cause a page fault," or "irregularity/assymetricness of register use", or "irregularity/complexity of instruction formats". I'd love to use those, but just don't know how to measure them. Also, I'd be happy to hear corrections for some of these.

So, here's a table of 12 implementations of various architectures, one per architecture, with the attributes above. Just for fun, I'm going to leave the architectures coded at first, although I'll identify them later. I'm going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and also, at the head of each column, I'm going to put a rule, which, in that column, most of the RISCs obey. Any RISC that does not obey it is marked with a +; any CISC that DOES obey it is marked with a *. So...

1991
CPU        Age        3a 3b 3c 3d      4a 4b 5a 5b        6a 6b     # ODD
RULE        <6        =1 =4 <5 =0      =0 =1 <2 =1        >4 >3
-------------------------------------------------------------------------
A1        4         1  4  1  0         0  1  0  1         8  3+      1
B1        5         1  4  1  0         0  1  0  1         5  4       -
C1        2         1  4  2  0         0  1  0  1         5  4       -
D1        2         1  4  3  0         0  1  0  1         5  0+      1
E1        5         1  4 10+ 0         0  1  0  1         5  4       1
F1        5         2+ 4  1  0         0  1  0  1         4+ 3+      3
G1        1         1  4  4  0         0  1  1  1         5  5       -
H1        2         1  4  4  0         0  1  0  1         5  4       -   RISC
---------------------------------------------------------------
L4        26         4  8  2* 0*       1  2  2  4         4  2       2   CISC
M2        12        12 12 15  0*       1  2  2  4         3  3       1
N1        10        21 21 23  1        1  2  2  4         3  3       -
O3        11        11 22 44  1        1  2  2  8         4  3       -
P3        13        56 56 22  1        1  6  2 24         4  0       -

An interesting exercise is to analyze the ODD cases.

First, observe that of 12 architectures, in only 2 cases does an architecture have an attribute that puts it on the wrong side of the line. Of the RISCs:

A1 is slightly unusual in having more integer registers, and less FP than usual. [Actually, slightly out of date, 29050 is different, using integer register bank instead, I hear.]
D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0).
E1 seems odd in having a large number of address modes. I think most of this is an artifact of the way that I counted, as this architecture really only has a fundamentally small number of ways to create addresses, but has several different-sized offsets and combinations, but all within 1 4-byte instruction; I believe that it's addressing mechanisms are fundamentally MUCH simpler than, for example, M2, or especially N1, O3, or P3, but the specific number doesn't capture it very well.
F1 .... is not sold any more.
H1 one might argue that this process has 2 sizes of instructions, but I'd observe that at any point in the instruction stream, the instructions are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit, i.e., not dynamically encoded in every instruction.

Of the processors called CISCs:

L4 happens to be one in which you can tell the length of the instruction from the first few bits, has a fairly regular instruction decode, has relatively few addressing modes, no indirect addressing. In fact, a big subset of its instructions are actually fairly RISC-like, although another subset is very CISCy.
M2 has a myriad of instruction formats, but fortunately avoided indirect addressing, and actually, MOST of instructions only have 1 address, except for a small set of string operations with 2. I.e., in this case, the decode complexity may be high, but most instructions cannot turn into multiple-memory-address-with-side-effects things.
N1,O3, and P3 are actually fairly clean, orthogonal architectures, in which most operations can consistently have operands in either memory or registers, and there are relatively few weirdnesses of special-cased uses of registers. Unfortunately, they also have indirect addressing, instruction formats whose very orthogonality almost guarantees sequential decoding, where it's hard to even know how long an instruction is until you parse each piece, and that may have side-effects where you'd like to do a register write-back early, but either: must wait until you see all of the instruction until you commit state or must have "undo" shadow-registers or must use instruction-continuation with fairly tricky exception handling to restore the state of the machine

It is also interesting to note that the original member of the family to which O3 belongs was rather simpler in some of the critical areas, with only 5 instruction sizes, of maximum size 10 bytes, and no indirect addressing, and requiring alignment (i.e., it was a much more RISC-like design, and it would be a fascinating speculation to know if that extra complexity was useful in practice).

Now, here's the table again, with the labels:

1991
CPU        Age        3a 3b 3c 3d     4a 4b 5a 5b        6a 6b  # ODD
RULE        <6        =1 =4 <5 =0     =0 =1 <2 =1        >4 >3
-------------------------------------------------------------------------
A1        4         1  4  1  0         0  1  0  1         8  3+  1        AMD 29K
B1        5         1  4  1  0         0  1  0  1         5  4   -        R2000
C1        2         1  4  2  0         0  1  0  1         5  4   -        SPARC
D1        2         1  4  3  0         0  1  0  1         5  0+  1        MC88000
E1        5         1  4 10+ 0         0  1  0  1         5  4   1        HP PA
F1        5         2+ 4  1  0         0  1  0  1         4+ 3+  3        IBM RT/PC
G1        1         1  4  4  0         0  1  1  1         5  5   -        IBM RS/6000
H1        2         1  4  4  0         0  1  0  1         5  4   -        Intel i860
---------------------------------------------------------------
L4        26         4  8  2* 0*       1  2  2  4         4  2   2        IBM 3090
M2        12        12 12 15  0*       1  2  2  4         3  3   1        Intel i486
N1        10        21 21 23  1        1  2  2  4         3  3   -        NSC 32016
O3        11        11 22 44  1        1  2  2  8         4  3   -        MC 68040
P3        13        56 56 22  1        1  6  2 24         4  0   -        VAX

General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal archiecture may actually be harder to make run fast than one that is less clean. Obviously, every weirdness has it's penalties.... But consider the fundamental difficulty of pipelining something like (on a VAX):

ADDL        @(R1)+,@(R1)+,@(R2)+

(something that, might theoretically arise from:

register **r1, **r2;
**r2++ = **r1++ + **r1++;

Now, consider what the VAX has to do:

1) Decode the opcode (ADD)
2) Fetch first operand specifier from I-stream and work on it.
  a) Compute the memory address from (r1)
    If aligned
      run through MMU
        if MMU miss, fixup
      access cache
        if cache miss, do write-back/refill
    Elseif unaligned
      run through MMU for first part of data
        if MMU miss, fixup
      access cache for that part of data
        if cache miss, do write-back/refill
      run through MMU for second part of data
        if MMU miss, fixup
      access cache for second part of data
        if cache miss, do write-back/refill
    Now, in either case, we now have a longword that has the address of the actual data.
  b) Increment r1 [well, this is where you'd LIKE to do it, or in parallel with step 2a).] However, see later why not...
  c) Now, fetch the actual data from memory, using the address just obtained, doing everything in step 2a) again, yielding the actual data, which we need to stick in a temporary buffer, since it doesn't actually go in a register.
3) Now, decode the second operand specifier, which goes thru everything that we did in step 2, only again, and leaves the results in a second temporary buffer. Note that we'd like to be starting this before we get done with all of 2 (and I THINK the VAX9000 probably does that??) but you have to be careful to bypass/interlock on potential side-effects to registers .... actually, you may well have to keep shadow copies of every register that might get written in the instruction, since every operand can use auto-increment/decrement. You'd probably want badly to try to compute the address of the second argument and do the MMU access interleaved with the memory access of the first, although the ability of any operand to need 2-4 MMU accesses probably makes this tricky. [Recall that any MMU access may well cause a page fault....]
4) Now, do the add. [could cause exception]
5) Now, do the third specifier .... only, it might be a little different, depending on the nature of the cache, that is, you cannot modify cache or memory, unless you know it will complete. (Why? well, suppose that the location you are storing into overlaps with one of the indirect-addressing words pointed to by r1 or 4(r1), and suppose that the store was unaligned, and suppose that the last byte of the store crossed a page boundary and caused a page fault, and that you'd already written the first 3 bytes. If you did this straightforwardly, and then tried to restart the instruction, it wouldn't do the same thing the second time.
6) When you're sure all is well, and the store is on its way, then you can safely update the two registers, but you'd better wait until the end, or else, keep copies of any modified registers until you're sure it's safe. (I think both have been done ??)
7) You may say that this code is unlikely, but it is legal, so the CPU must do it. This style has the following effects:   a) You have to worry about unlikely cases.
  b) You'd like to do the work, with predictable uses of functional units, but instead, they can make unpredictable demands.
  c) You'd like to minimize the amount of buffering and state, but it costs you in both to go fast.   d) Simple pipelining is very, very tough: for example, it is pretty hard to do much about the next instruction following the ADDL, (except some early decode, perhaps), without a lot of gates for special-casing. (I've always been amazed that CVAX chips are fast as they are, and VAX 9000s are REALLY impressive...)
  e) EVERY memory operand can potentially cause 4 MMU uses, and hence 4 MMU faults that might actually be page faults...
  f) AND there are even worse cases, like the addp6 instruction, that can require 40 pages to be resident to complete...
8) Consider how "lazy" RISC designers can be:
  a) Every load/store uses exactly 1 MMU access.
  b) The compilers are often free to re-arrange the order, even across what would have been the next instruction on a CISC. This gets rid of some stalls that the CISC may be stuck with (especially memory accesses).
  c) The alignment requirement avoids especially the problem with sending the first part of a store on the way before you're SURE that the second part of it is safe to do.

Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper:

CPU      Age       3a 3b 3c 3d     4a 4b 5a 5b     6a 6b    # ODD
RULE     <6        =1 =4 <5 =0     =0 =1 <2 =1     >4 >3
-------------------------------------------------------------------------
J1        5         4+ 8+ 9+ 0      0  1  0  2      4+ 3+    5    Clipper
K1        3         2+ 8+ 9+ 0      0  1  2+ -      5  3+    5    Intel 960KB

(I think an ARM would be in this area as well; I think somebody once sent me an ARM-entry, but I can't find it again; sorry.)

Note: slight modification (I'll integrate this sometime):

From [email protected]  Mon Nov 29 12:59:55 1993
Subject: Re: Why are Motorola's slower than Intel's ? [really what's a RISC]
Newsgroups: comp.arch<
Organization: Massachusetts Institute of Technology

Since you made your table IBM has released a couple chips that support unaligned accesses in hardware even across cache line boundaries and may store part of an unaligned object before taking a page fault on the second half, if the object crosses a page boundary.

These are the RSC (single chip POWER) and PPC 601 (based on RSC core).
John Carr ([email protected])

(Back to me; jfc's comments are right; if I had time, I'd add another line to do PPC ... which, in some sense replays the S/360 -> S/370 history of relaxing alignment restrictions somewhat. I conejcture that at least some of this was done to help Apple s/w migration.)

SUMMARY:

1) RISCs share certain architectural characteristics, although there are differences, and some of those differences matter a lot.
2) However, the RISCs, as a group, are much more alike than the CISCs as a group.
3) At least some of these architectural characteristics have fairly serious consequences on the pipelinability of the ISA, especially in a virtual-memory, cached environment.
4) Counting instructions turns out to be fairly irrelevant:
  a) It's HARD to actually count instructions in a meaningful way... (if you disagree, I'll claim that the VAX is RISCier than any RISC, at least for part of its instruction set :-)
  Why: VAX has a MOV opcode, whereas RISCs usually have a whole set of opcodes for {LOAD/STORE} {BYTE, HALF, WORD}
  b) More instructions aren't what REALLY hurts you, anywhere near as much features that are hard to pipeline:
  c) RISCs can perfectly well have string-support, or decimal arithmetic support, or graphics transforms ... or lots of strange register-register transforms, and it won't cause problems ..... but compare that with the consequence of adding a single instruction that has 2-3 memory operands, each of which can go indirect, with auto-increments, and unaligned data...

PART II - ADDRESSING MODES

I promised to repost this with fixes, and people have been asking for it, so here it is again: if you saw it before, all that's really different is some fixes in the table, and a few clarified explanations:

THE GIANT ADDDRESSING MODE TABLE (Corrections happily accepted)

This table goes with the higher-level table of general architecture characteristics.

Address mode summary
r        register
r+        autoincrement (post)        [by size of data object]
-r        autodecrement (pre)        [by size,...and this was the one I meant]
>r        modify base register        [generally, effective address -> base]

NOTE: sometimes this subsumes r+, -r, etc, and is more general, so I categorize it as a separate case.

d        displacement                d1 & d2 if 2 different displacements
x        index register
s        scaled index
a        absolute        [as a separate mode, as opposed to displacement+(0)
I        Indirect

Shown below are 22 distinct addressing modes [you can argue whether these are right categories]. In the table are the number of different encodings/variations [and this is a little fuzzy; you can especially argue about the 4 in the HP PA column, I'm not even sure that's right]. For example, I counted as different variants on a mode the case where the structure was the same, but there were different-sized displacements that hadto be decoded. Note that meaningfully counting addressing modes is at least as bad as meaningfully counting opcodes; I did the best I could, and I spect a lot of hours lookingat manuals for the chips I hadn't programmed much, and in some cases, even after hours, it was hard for me to figure out meaningful numbers... Most of these archiectures are used in general-purpose systems and most have at least one version that uses caches: those are importantbecause many of the issues in thinking about addressing modes come from their interactions with MMUs and caches...

        1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20  21  22
                                                                     r   r
                                                           r  r  r   +d1 +d1
                    r  r  r |              |   r  r |   r  r+ +d +d1 I   +s
           r  r  r  +d +x +s|         s+ s+|s+ +d +d|r+ +d I  I  I   +s  I  
        r  +d +x +s >r >r >r|r+ -r a  a  r+|-r +x +s|I  I  +s +s +d2 +d2 +d2
        -- -- -- -- -- -- --|-- -- -- -- --|-- -- --|-- -- -- -- --- --- ---
AMD 29K  1                  |              |        |   
Rxxx        1               |              |        |   
SPARC       1  1            |              |        |   
88K         1  1  1         |              |        |   
HP PA       2  1  1  4  1  1|              |        |   
ROMP     1  2               |              |        |    
POWER       1  1     1  1   |              |        |    
i860        1  1     1  1   |              |        |    
Swrdfish 1  1  1            |       1      |        |    
ARM      2  2     2  1     1| 1  1
Clipper  1  3  1            | 1  1  2      |        |    
i960KB   1  1  1  1         |       2  2   |    1   |    

S/360       1               |                   1   |    
i486     1  3  1  1         | 1  1  2      |    2  3|   
NSC32K      3               | 1  1  3  3   |       3|             9       
MC68000  1  1               | 1  1  2      |    2   | 
MC68020  1  1               | 1  1  2      |    2  4|                 16  16
VAX      1  3     1         | 1  1  1  1  1| 1     3| 1  3  1  3

COLUMN NOTES

1) Columns 1-7 are addressing modes used by many machines, but very few, if any clearly-RISC architectures use anything else. They are all characterizedby what they don't have: 2 adds needed before generating the address indirect addressing variable-sized decoding
2) Columns 13-15 include fairly simple-looking addressing modes, which however, may require2 back-to-back adds beforet he address is available. [may because someof them use index-register=0 or something to avoid indexing, and usually in such machines, you'll see variable timing figures, depending onuse of indexing.]
3) Columns 16-22 use indirect addressing.

ROW NOTES

1) Clipper & i960, of current chips, are more on the RISC-CISC border, or are sort of "modern CISCs". ARM is also characterized (by ARM people, Hot Chips IV: "ARM is not a "pure RISC".
2) ROMP has a number of characteristics different from the rest of the RISCs, you might call it "early RISC", and it is of course no longer made.
3) You might consider HP PA a little odd, as it appears to have more addressing modes, in the same way that CISCs do, but I don't think this is the case: it's an issue of whether you call something several modes or onemode with a modifier, just as there is trouble counting opcodes (with & without modifiers). From my view, neither PA nor POWER have truly "CISCy" addressing modes.
4) Notice difference between 68000 and 68020 (and later 68Ks): a bunch of incredibly-general & complex modes got added...
5) Note that the addressing on the S/360 is actually pretty simple, mostly base+displacement, although RX-addressing does take 2 regs+offset.
6) A dimension not shown on this particular chart, but also highly relevant, is that this chart shows the different types of modes, not how many addresses can be found in each instruction. That may be worth noting also:

AMD : i960        1        one address per instruction
S/360 - MC68020   2        up to 2 addresses
VAX               6        up to 6

By looking at alignment, indirect addressing, and looking only at those chips that have MMUs, consider the number of times an MMU might be used per instruction for data address translations:

AMD - Clipper     2        [Swordfish & i960KB: no TLB]
S/360 - NSC32K    4
MC68Ks (all)      8
VAX              24

When RS/6000 does unaligned, it must be in the same cache line (and thus also in same MMU page), and traps to software otherwise, thus avoiding numerous ugly cases.

Note: in some sense, S/360s & VAXen can use an arbitrary number of translations per instruction, with MOVE CHARACTER LONG, or similar operations & I don't count them as more, because they're defined to be interruptable/restartable, saving state in general-purpose registers, rather than hidden internal state.

SUMMARY

1) Computer design styles mostly changed from machines with:
    a. 2-6 addresses per instruction, with variable sized encoding address specifiers were usually "orthogonal", so that any could go anywhere in an instruction
    b. sometimes indirect addressing
    c. sometimes need 2 adds before effective address is available
    d. sometimes with many potential MMU accesses (and possible exceptions) per instruciton, often buried in the middle of the instruction, and often after you'd normally want to commit state because of auto-increment or other side effects.
to machines with:
    a. 1 address per instruction
    b. address specifiers encoded in small # of bits in 32-bit instruction
    c. no indirect addressing
    d. never need 2 adds before address available
    e. use MMU once per data access

and we usually call the latter group RISCs. I say "changed" because if you put this table together with the earlier one, which has the age in years, the older ones were one way, and the newer ones are different.

2) Now, ignoring any other features, but looking at this single attribute (architectural addressing features and implementation effects therof), it ought to be clear that the machines in the first part of the table are doing something technically different from those in the second part of the table. Thus, people may sometimes call something RISC that isn't, for marketing reasons, but the people calling the first batch RISC really did have some serious technical issues at heart.

3) One more time: this is not to say that RISC is better than CISC, or that the few in the middle are bad, or anything like that ... but that there are clear technical characteristics...

PART III - MORE ON TERMINOLOGY; WOULD YOU CALL THE CDC 6600 A RISC?

Article: 39495 of comp.arch
Newsgroups: comp.arch
From: [email protected] (John R. Mashey)
Subject: Re: Why CISC is bad (was P6 and Beyond)
Organization: Silicon Graphics, Inc.
Date: Wed, 6 Apr 94 18:35:01 PDT

In article <[email protected]>, [email protected] (Andrea Chen) writes:

You may be correct on the creation of the term, but RISC does refer to a school of computer design that dates back to the early seventies.

This is all getting fairly fuzzy and subjective, but it seems very confusing to label RISC as a school of thought that dates back to the early 1970s.

1) One can say that RISC is a school of thought that got popular in the early-to-mid 80's, and got widespread commercial use then.
2) One can say that there were a few people (like John Cocke & co at IBM) who were doing RISC-style research projects in the mid-70s. 3) But if you want to go back, as has been discussed in this newsgroup often, a lot of people go back to the CDC 6600, whose design started in 1960, and was delivered in 4Q 1964. Now, while this wouldn't exactly fit the exact parameters of current RISCs, a great deal of the RISC-style approach was there in the central processor ISA:
    a) Load/store architecture.
    b) 3-address register-register instructions
    c) Simply-decoded instruction set
    d) Early use of instructions schedule by compiler, expectation that you'd usually program in high-level language and not often resort to assembler, as you'd expect compiler to do well.
    e) More registers than common at the time
    f) ISA designed to make decode/issue easy

Note that the 360 /91 (1967) offered a good example of building a CISC-architecture into a high-performance machine, and was an interesting comparison to the 6600.

4) Maybe there is some way to claim that RISC goes back to the 1950s, but in general, most machines of the 1950s and 1960s don't feel very RISCy (to me). Consider Burroughs B5000s; IBM 709x, 707x, 1401s; Univac 110x; GE 6xx, etc, and of course, S/360s. Simple load/store architectures were hard to find; there were often exciting instruction decodings required; indirect addressing was popular; machines often had very few accumulators.

5) If you want to try sticking this in the matrix I've published before, as best as I recall, the 6600 ISA generally looked like:

CPU      3a 3b 3c 3d    4a 4b 5a 5b    6a 6b    # ODD
RULE     =1 =4 <5 =0    =0 =1 <2 =1    >4 >3
------------------------------------------------------
CDC 6600  2  *  1  0     0  1  0  1     3  3    4 (but  ~1 if fair)

That is:

2: it has 2 instruction sizes (not 1), 15 & 30 bits (however, were packed into 60-bit words, so if you had 15, 30, 30, the second 30-bitter would not cross word boundaries, but would start in the second word.)
*: 15-and-30 bit instructions, not 32-bit.
1: 1 addressing mode [Note: Time McCaffrey emailed me that one might consider there to be more, i.e., you could set address register to combinations of the others to give autoincrement/decrement/Index+offset, etc). In any case, you compute an address as a simpel combination of 1-2 registers, andthen use the address, without furhter side-effects.
0: no indirect addressing
1: have one memory operand per instruction
0: do NOT support arbitrary alignment of operands in memory (well, it was a word-addressed machine :-)
1: use an MMU for data translation no more than once per instruction (MMU used loosely here)
3,3: had 3-bit fields for addressing registers, both index and FP

Now, of the 10 ISA attributes I'd proposed for identifying typical RISCs, the CDC 6600 obeys 6. It varies in having 2 instruction formats, and in having only 3 bits for register fields, but it had simple packingof the instructions in to fixed-size words, and register/accumulators were pretty expensive in those days (some popular machines only had one accumulator and a few index registers, so 8 of each was a lot). Put another way: it had about as many registers as you'd conveniently build in a high-speed machine, and while they packed 2-4 operations into a 60-bit word, the decode was pretty straighforward. Anyway, given the caveats, I'd claim that the 6600 would fit much better in the RISC part of the original table...

PART IV - RISC, VLIW, STACKS

Article: 43173 of comp.arch
Newsgroups: comp.sys.amiga.advocacy,comp.arch
From: [email protected] (John R. Mashey)
Subject: Re: PG: RISC vs. CISC was: Re: MARC N. BARR
Date: Thu, 15 Sep 94 18:33:14 PDT

In article <[email protected]>, [email protected] writes:

Really? The Venerable John Mashey's table appears to contain as many exceptions to the rule about number of GP registers as most others. I'm sure if one were to look at the various less conventional processors, there would be some clearly RISC processors that didn't have a load-store architecture - stack and VLIW processors spring to mind.

I'm not sure I understand the point. One can believe any of several things:   a) One can believe RISC is some marketing term without technical meaning whatsoever. OR
  b) One can believe that RISC is some collection of implementation ideas. This is the most common confusion.
  c) One can believe that RISC has some ISA meaning (such as RISC == small number of opcodes) ... but have a different idea of RISC than do most chip architects who build them. If you want to pay words extra money every Friday to mean something different than what they mean to practitioners ... then you are free to do so, but you will have difficulty communicating with practitioners if you do so.
  EX: I'm not sure how stack architectures are "clearly RISC" (?) Maybe CRISP, sort of. Burroughs B5000 or Tandem's original ISA: if those are defined as RISC, the term has been rendered meaningless.
  EX: VLIWs: I don't know any reason why I'd call VLIWs, in general, either clearly RISC or clearly not. VLIW is a technique for issuing instructions to more functional units than you have the die space/cycle time to decode more dynamically. There gets to be a fuzzy line between:
    i. A VLIW, especially if it compresses instructions in memory, then expands them out when brought into the cache.     ii. A superscalar RISC, which does some predecoding on the way from memory->cache, adding "hint" bits or rearranging what it keeps there, speeding up cache->decode->issue.
  At least some VLIWs are load/store architectures, and the operations they do look usually look like typical RISC operations. OR, you can believe that:

  c) RISC is a term used to characterize a class of relatively-similar ISAs mostly developed in the 1980s. Thus, if a knowledgable person looks at ISAs, they will tend to cluster various ISAs as:
    1) Obvious RISC, fits the typical rules with few exceptions.
    2) Obviously not-RISC, fits the inverse of the RISC rules with relatively few exceptions. Sometimes people call this CISC ... but whereas RISCs, as a group, have realitvely similar ISAs, the CISC label is sometimes applied to a widely varying set of ISAs.
    3) Hybrid / in-the-middle cases, that either look like CISCy RISCs, or RISCy CISCs. There are a few of these.
  Cases 1-3 are appropriate may apply to reasonably contemporaneous processors, and make some sense. and then 4)
  4) CPUs for which RISC/CISC is probably not a very relevant classification. I.e., one can apply the set of rules I've suggested, and get an exception-count, but it may not mean much in practice, especially when applied to older CPUs created with vastly different constraints than current ones, or embedded processors, or specialized ones. Sometimes an older CPU might have been designed with some similar philosophies (i.e., like CDC 6600 & RISC, sort of) whether or not it happend to fit the rules. Sometimes, die-space constraints my have led to "simple" chips, without making them fit the suggested criteria either. personally, torturous arguments about whether a 6502, or a PDP-8, or a 360 /44 or an XDS Sigma 7, etc, are RISC or CISC ... do not usually lead to great insight. After a while such arguments are counting angels dancing on pinheads ("Ahh, only 10 angles, must be RISC" :-).

In this belief space, one tends to follow Hennessy & Patterson's comment in E.9 that "In the history of computing, there has never been such widespread agreement on computer architecture." None of this pejorative of earlier architectures, just the observation that the ISAs newly-developed in the 1980s were far more similar that the earlier groups of ISAs. [I recall a 2-year period in which I used IBM 1401, IBM 7074, IBM 7090, Univac 1108, and S/360, of which only the 7090 and 1108 bore even the remotest resemblance to each other, i.e., at least they both had 36-bit words.]

Summary: RISC is a label most commonly used for a set of ISA characteristics chosen to ease the use of aggressive implementation techniques found in high-performance processors (regardless of RISC, CISC, or irrelevant). This is a convenient shorthand, but that's all, although it probably makes sense to use the term thae way it's usually meant by people who do chips for a living.

Risk over time

2000-04-01 08:00:00

This is archived from John Norstad's now defunct norstad.org.

Table of Contents

Introduction
The Fallacy of Time Diversification
The Utility Theory Argument
Probability of Shortfall
The Option Pricing Theory Argument
Human Capital
Conclusion
Appendix - A Better Bar Chart Showing Risk Over Time


Introduction

In an otherwise innocent conversation on the Vanguard Diehards forum on the Morningstar web site, I questioned the popular opinion that the risk of investing in volatile assets like stocks decreases as one's time horizon increases. I mentioned that several highly respected Financial Economists believe this opinion to be simply wrong, or at least highly suspect, and that after much study I have come to agree with them. Taylor Larimore asked me to explain. Hence this note.

The following sections are largely independent. Each one presents a single argument. All but the last section present arguments against the popular opinion. The last section presents what I think is the only valid argument supporting the popular opinion.

Past experience tells me that I am unlikely to win any converts with this missive. It's possible, however, that the ideas presented here will make someone think a bit harder about his unexamined assumptions, and will in some small way make him a wiser person. I hope so. In any case, I have wanted to write up my thoughts on this problem for some time now, and I thank Taylor for giving me the little nudge I needed to actually sit down and do it.

If there is one thing I would like people to learn from this paper, it is to disabuse them of the popular notion that stock investing over long periods of time is safe because good and bad returns will somehow "even out over time." Not only is this common opinion false, it is dangerous. There is real risk in stock investing, even over long time horizons. This risk is not necessarily bad, because it is accompanied by the potential for great rewards, but we cannot and should not ignore the risk.

This paper led to a lively debate on Morningstar where the nice Diehards and I exchanged a long sequence of messages on conversations 5266 and 5374. If you find this paper interesting, you might also want to check out those conversations.

The Fallacy of Time Diversification

Portfolio theory teaches that we can decrease the uncertainty of a portfolio without sacrificing expected return by diversifying over a wide range of assets and asset classes. Some people think that this principle can also be used in the time dimension. They argue that if you invest for a long enough time, good and bad returns tend to "even out" or "cancel each other out," and hence time diversifies a portfolio in much the same way that investing in multiple assets and asset classes diversifies a portfolio.

For example, one often hears advice like the following: "At your young age, you have enough time to recover from any dips in the market, so you can safely ignore bonds and go with an all stock retirement portfolio." This kind of statement makes the implicit assumption that given enough time good returns will cancel out any possible bad returns. This is nothing more than a popular version of the supposed "principle" of time diversification. It is usually accepted without question as an obvious fact, made true simply because it is repeated so often, a kind of mean reversion with a vengeance.

In the investing literature, the argument for this principle is often made by observing that as the time horizon increases, the standard deviation of the annualized return decreases. I most frequently see this illustrated as a bar chart displaying a decreasing range of historical minimum to maximum annualized returns over increasing time periods. Some of these charts are so convincing that one is left with the impression that over a very long time horizon investing is a sure thing. After all, look at how tiny those 30 and 40 year bars are on the chart, and how close the minimum and maximum annualized returns are to the average. Give me enough time for all those ups and downs in the market to even out and I can't lose!

While the basic argument that the standard deviations of the annualized returns decrease as the time horizon increases is true, it is also misleading, and it fatally misses the point, because for an investor concerned with the value of his portfolio at the end of a period of time, it is the total return that matters, not the annualized return. Because of the effects of compounding, the standard deviation of the total return actually increases with time horizon. Thus, if we use the traditional measure of uncertainty as the standard deviation of return over the time period in question, uncertainty increases with time.

(The incurious can safely skip the math in this paragraph.) To be precise, in the random walk model, simply compounded rates of return and portfolio ending values are lognormally distributed. Continuously compounded rates of returns are normally distributed. The standard deviation of the annualized continuously compounded returns decreases in proportion to the square root of the time horizon. The standard deviation of the total continuously compounded returns increases in proportion to the square root of the time horizon. Thus, for example, a 16 year investment is 4 times as uncertain as a 1 year investment if we measure "uncertainty" as standard deviation of continuously compounded total return.

As an example, those nice bar charts would look quite different and would certainly leave quite a different impression on the reader if they properly showed minimum and maximum total returns rather than the misleading minimum and maximum annualized returns. As time increases, we'd clearly be able to see that the spread of possible ending values of our portfolio, which is what we care about, gets larger and larger, and hence more and more uncertain. After 30 or 40 years the spreads are quite enormous and clearly show how the uncertainty of investing increases dramatically at very long horizons. Investing over these long periods of time suddenly changes in the reader's mind from a sure thing to a very unsure thing indeed!

For an example of a bar chart which shows a better picture of uncertainty and risk over time, see the Appendix below.

Common variants of this time diversification argument can be found in many popular books and articles on investing, including those by highly respected professionals and even academics. For example, John Bogle used this argument in his otherwise totally excellent February, 1999 speech The Clash of the Cultures in Investing: Complexity vs. Simplicity (see his chart titled "Risk: The Moderation of Compounding 1802-1997," which if it had been properly drawn might well have been titled "Risk: The Exacerbation of Compounding 1802-1997"). Burton Malkiel uses a similar argument and chart in his classic book A Random Walk Down Wall Street (see chapter 14 of the sixth edition). (I deliberately chose two of my all-time favorite authors here to emphasize just how pervasive this fallacy is in the literature.)

The fact that some highly respected, justly admired and otherwise totally worthy professionals use this argument does not make it correct. The argument is in fact just plain wrong - it's a fallacy, pure and simple. When you see it you should dismiss it in the same way that you dismiss urban legends about alligators in sewers and hot stock tips you find on the Internet (you do dismiss those, don't you? :-). It's difficult to do this because the argument is so ubiquitous that it has become an unquestioned assumption in the investment world.

For more details on this fallacy, see the textbook Investments by Bodie, Kane, and Marcus (fourth edition, chapter 8, appendix C), or sections 6.7 and 6.8 of my own paper Random Walks.

The Utility Theory Argument

Robert Merton and Paul Samuelson (both recipients of the Nobel prize in Economic Sciences) use the following argument (among others) to dispute the notion that time necessarily ameliorates risk, and are responsible for much of the mathematics behind the argument. Their argument involves utility theory, a part of Economics which requires a bit of introduction.

Most investors are "risk-averse." For example, a risk-averse investor would refuse to play a "fair game" where he has an equal chance of losing or winning the same amount of money, for an expected return of 0%. Whenever the outcome of an investment is uncertain, a risk-averse investor demands an expected return greater than 0% as a "risk premium" that compensates him for undertaking the risk of the investment. (Actually, such an investor demands an expected return in excess of the risk-free rate, but we'll ignore that detail for the moment.) In general, investors demand higher risk premiums for more volatile investments. One of the fundamental truths of the marketplace is that risk and return always go hand-in-hand in this way. (To be precise, "systematic risk" and "excess return" always go hand-in-hand. Fortunately these are once again details that we don't need to worry about in this paper.)

In Economics the classic way to measure this notion of "risk aversion" is by using "utility functions." A utility function gives us a way to measure an investor's relative preference for different levels of wealth and to measure his willingness to undertake different amounts of risk in the hope of attaining greater wealth. Among other things, formalizing the notion of risk aversion using utility functions makes it possible to develop the mathematics of portfolio optimization. Thus utility theory lies at the heart of and is a prerequisite for modern portfolio theory.

There's a special class of utility functions called the "iso-elastic" functions which characterizes those investors whose relative attitudes towards risk are independent of wealth. For example, suppose you have a current wealth of $10,000 and your preferred asset allocation with your risk tolerance is 50% stocks and 50% bonds for some fixed time horizon. Now suppose your wealth is $100,000 instead of $10,000. Would you change your portfolio's asset allocation for the same time horizon? If you wouldn't, and if you wouldn't change your asset allocation at any other level of wealth either, then you have an iso-elastic utility function (by definition, whether you know it or not). These iso-elastic functions have the property of "constant relative risk aversion."

Note that the definition of these iso-elastic functions is stated in terms of an investor's risk preferences at different levels of wealth, over some fixed time horizon (e.g., 1 year). The property of "constant relative risk aversion" means that the investor's preferred asset allocation (relative exposure to risk) is constant with respect to wealth over this fixed horizon. For the same fixed time horizon, this kind of investor prefers the same asset allocation at all levels of wealth. We aren't ready yet to start talking about other time horizons. We'll get to that later.

While there's no reason to believe that any given investor has or should have iso-elastic utility, it seems reasonable to say that such a hypothetical investor is not pathological. Indeed, constant relative risk aversion is often used as a neutral benchmark against which investor's attitudes towards risk and wealth are measured.

For example, one reasonable investor might be more conservative when he is rich, perhaps because he is concerned about preserving the wealth he has accumulated, whereas when he is poor he takes on more risk, perhaps because he feels he's going to need much more money in the future. This investor has "increasing relative risk aversion."

On the other hand, a different equally reasonable investor might have the opposite attitude. She is more aggessive when she is rich, perhaps because she feels at some point that she already has more than enough money, so she can afford to take on more risk with the excess, whereas when she is poor, she is more concerned about losing the money she needs to live on, so she is more conservative. This investor has "decreasing relative risk aversion."

We can easily imagine more complicated scenarios, where an investor might have increasing relative risk aversion over one range of wealth and decreasing relative risk aversion over some other range.

These attitudes are all reasonable possibilities. All of these investors are risk-averse. They differ only in their degree of risk aversion and their patterns of risk aversion as their wealth increases and decreases. None of the utility functions corresponding to these preferences and patterns are right or wrong or better or worse than the other ones. Everything depends on the individual investor's attitudes. Utility theory does not dictate or judge these attitudes, it just gives us a way to measure them.

In any case, we often think of iso-elastic utility with constant relative risk aversion as a kind of central or neutral position.

Note once again that up to this point in the discussion we have kept the time horizon fixed at some constant value (e.g., 1 year). We have not yet talked about how attitudes towards risk might change with time horizon. All we have talked about so far is how attitudes towards risk might change with wealth over the same fixed time horizon.

Now we're ready for the interesting part of the argument, where we finally make the time horizon a variable. If time necessarily ameliorates risk, one would expect that any rational investor's optimal asset allocation would become more aggressive with longer time horizons. For example, one would certainly expect this to be true for investors with middle-of-the-road iso-elastic utility.

When we do the simple math using calculus and probability theory in the random walk model, however, we get a big surprise. This is not at all what happens. For iso-elastic utility functions, relative attitudes towards risk are also necessarily independent of time horizon. For example, if a 50/50 stock/bond asset allocation is optimal for a 1 year time horizon for an investor with iso-elastic utility, it is also optimal for a 20 year time horizon and all other time horizons!

To summarize, if a rational investor's relative attitudes towards risk are independent of wealth, they are also necessarily independent of time horizon. This is a deep result tying together the three notions of risk, wealth, and time. The result is counter-intuitive to many, but it is nonetheless true. The mathematics is inescapable.

Note once again that we are not arguing that all investors, or even most investors, have iso-elastic utility. The use of iso-elastic utility in this argument simply calls into question the conventional wisdom that time ameliorates risk under all circumstances, regardless of one's attitudes towards risk and wealth. The argument should make people who believe unconditionally that time ameliorates risk to be at least willing to rethink their position.

For more information about utility theory see my paper An Introduction to Utility Theory. For the complete formal proof that investors with iso-elastic utility have the same optimal asset allocation at all time horizons in the random walk model, see my paper An Introduction to Portfolio Theory. Beware: both papers have lots of mathematics. For the almost incomprehensibly complex (but fascinating) math in the general case where the investor is permitted to modify his portfolio continuously through time, see Robert Merton's exquisitely difficult book Continuous Time Finance. (Someday I hope to know enough about all this stuff to actually understand this book. That day is still rather far away, I'm afraid, but it's a goal worth striving towards, and sometimes I actually manage to fool myself into thinking that I'm making some progress.)

Probability of Shortfall

Another argument often found in the popular literature on investing is that as the time horizon increases, the probability of losing money in a risky investment decreases, at least for normal investments with positive expected returns. This is true both when one looks at historical market return data and in the abstract random walk model. (This, by the way, is essentially the argument that Taylor Larimore presented in the Morningstar conversation referred to in the Introduction.) It's even true if we consider not just the probability of losing money, but the probability of making less money than we could in a risk-free investment like US Treasury bonds, provided that our risky investment has an expected return higher than that of T-Bonds.

For example, in the random walk model of the S&P; 500 stock market index in the Appendix below, the probability that a stock investment will earn less than a bank account earning 6% interest is 42% after 1 year. After 40 years this probability decreases to only 10%. Doesn't this prove that risk decreases with time?

The problem with this argument is that it treats all shortfalls equally. A loss of $1000 is treated the same as a loss of $1! This is clearly not fair. For example, if I invest $5000, a loss of $1000, while less likely, is certainly a more devastating loss to me than is a loss of $1, and it should be weighted more heavily in the argument. Similarly, the argument treats all gains equally, which is not fair by the same reasoning.

As a first example, consider a simple puzzle which I hope will make this problem clear. Suppose for the sake of argument that "probability of loss," which is a special case of "probability of shortfall," is a good definition of the "risk" of an investment. Consider two investments A and B which both cost $1000. With A, there's a 50% chance of making $500 and a 50% chance of losing $1. With B, there's a 50% chance of making $1 and a 50% chance of losing $500. A and B have exactly the same probability of loss: 50%. Therefore A and B have exactly the same "risk." What's wrong with this picture?

As a second example, suppose you had the opportunity to make some kind of strange investment which cost $5000 and which had two possible outcomes. In the good case, which has probability 99%, you make $500. In the bad case, which has probability 1%, you lose your entire $5000. Is this a good investment? If not, why not - the probability of loss is only 1%, isn't it? Can't we safely ignore such a small chance of losing money? This investment even has a positive expected rate of return of 8.9%! In this kind of extreme example the problem becomes obvious. This is perhaps not such a great investment after all, at least for some people, because we simply must take into account both the probabilities of the possible outcomes and their magnitudes, not just the probability that we're going to lose money.

To make the problem even worse, suppose you're a starving graduate student with only $5000 to your name, and you need that money to pay your tuition. Is this a good investment for you? Now suppose instead that you're Bill Gates. Would you be willing to risk the loss of an insignificant fraction of your fortune to make a profit of $500 with probability 99%? The situation changes a bit, doesn't it? It seems clear that we also must somehow take into account the investor's current total wealth when we investigate the meaning of "risk" in our example.

As one last mental experiment using this example, how would the situation change if in the bad case you only lost $1000, one fifth of your investment, instead of all of it? The probability of loss is still the same 1%, but it's really a radically different problem, isn't it?

The point of our deliberately extreme pair of examples is that the probability of shortfall measure is much too oversimplified to be a reliable measure of the "risk" of an investment.

Our third example is more realistic. In this example we look at investing in the S&P; 500 stock market index over 1 year and over 3 years.

To begin, we compute that the probability of losing money in the S&P; 500 random walk model is 31% over 1 year but drops to 19% after 3 years. If we use the naive definition of risk as "probability of loss," we would stop thinking about the problem at this point and conclude that the 3 year investment is less risky than the 1 year investment. We hope that at this point in the discussion, however, the reader is convinced that we need to look further.

The probability of losing 20% or more of our money in the S&P; 500 is 5.0% after 1 year. The probability is 6.4% after 3 years. This bad outcome is actually more likely after 3 years than it is after 1 year!

The situation rapidly deteriorates when we start to look at the really bad outcomes, the ones that really scare us. Losing 30% or more of our money is 2.8 times more likely after 3 years than it is after 1 year. Losing 40% or more of our money is 9.7 times more likely after 3 years than it is after 1 year. Losing 50% or more of our money is a whopping 71 times more likely after 3 years than it is after 1 year!

When we start looking at more of the possible outcomes than just the single "lose money" outcome, the risk picture becomes much less clear. It is no longer quite so obvious that the 3 year investment is less risky than the 1 year investment. We start to realize that the situation is more complicated than we had first thought.

Let's take a detour from all this talk about abstract math models for a moment and interject a historical note to go along with our third example. Astute readers might argue at this point that the probabilities of these really bad outcomes, in the 20-50% loss range, are very small, and they would be correct. Isn't it kind of silly to pay all this attention to these low probability possibilities? Can't we safely ignore them? One need only look to the years 1930-1932 to dismiss this argument. Over that 3 year period the S&P; 500 lost 61% of its value. In comparison, US Treasury bills gained 4.5% over the same period. There is no reason to believe that the same thing or even worse can't happen again in the future, perhaps over even longer time periods. It's interesting to note that the S&P; 500 has never had a loss anywhere near as large as 61% in a single year. (The largest one year loss was 43% in 1931.) It took three years of smaller losses to add up to the 61% total loss over 1930-1932. This illustrates the point we made in our example that disastrous losses actually become more likely over longer time horizons.

If you think that 3 years is too short a period of time, and that given more time stock investing must inevitably be a sure bet, consider the 15 years from 1968 through 1982, when after adjusting for inflation the S&P; 500 lost a total of 4.62%. Don't forget that with today's new inflation-protected US bonds, you could easily guarantee an inflation-adjusted return of 25% over 15 years (using a conservative estimate of a 1.5% annualized return in excess of inflation), no matter how bad inflation might get in the future. Are you really prepared to say beyond a shadow of a doubt that a period of high inflation and low stock returns like 1968-1982 can't happen again within your lifetime, or something even worse? Some older people in the US remember this period, which wasn't all that long ago, and they'll tell you that it was a very unpleasant time indeed to be a stock investor. If you'd like another example, consider the near total collapse of the German financial markets between the two world wars, or the recent experience of the Japanese markets, or the markets in other countries during prolonged bad periods in their histories, which in many cases lasted much longer than 15 years. Do you really feel that it's a 100% certainty that something like this couldn't happen here in the US? If we're going to take this notion of "risk" seriously, don't we have to deal with these possibilities, even if they have low probabilities? That in a nutshell is what our argument is all about. It's not just the theory and the abstract math models which teach us that risk is real over long time horizons. History teaches the same lesson.

While we cannot let these disastrous possible outcomes dominate our decision making, and none of the arguments in this paper do so, we also cannot dismiss them just because they're unlikely and they frighten us. Once again, when we think about risk, we have to consider both the magnitudes and the probabilities of all the possible outcomes. This includes the good ones, the bad ones, and the ones in the middle. None of the possible outcomes can be ignored, and none can be permitted to dominate.

Now let's return to our example of the S&P; 500 under-performing a 6% risk-free investment with probability 42% after 1 year but with a probability of only 10% after 40 years. The reason we cannot immediately conclude that the 40 year S&P; 500 investment is less risky than the 1 year investment is that over 40 years the spread of possible outcomes is very wide, and truly disastrous shortfalls of very large magnitudes become more likely, albeit still very unlikely. For sufficiently large possible shortfalls, they actually become much more likely after 40 years than they were after 1 year! We must take these possibilities into account in our assessment of the risk of the 40 year investment. We cannot simply pretend they don't exist or treat them the same as small losses. Similarly, truly enormous gains become more likely, and we must take those into account too. We have to consider all the possibilities.

To solve the problem with using probability of shortfall as a measure of risk, we must at least attach greater negative weights to losses of larger magnitude and greater positive weights to gains of larger magnitude. Then we must somehow take the (probabilistic) average of the weighted possibilities to come up with a fair measure of "risk." How can we do this? This is exactly what utility theory is all about - the assignment of appropriate weights to possible outcomes. Utility theory also addresses the problem of changes in attitudes towards risk as a function of an investor's current wealth.

In the risk-averse universe in which we live, gains and losses of equal magnitude do not just cancel out, so the math isn't trivial. A loss of $x is more of a "bad thing" than a gain of $x is a "good thing." In utility theory this is called "decreasing marginal utility of wealth," and it's equivalent to the notion of "risk aversion." In particular, really disastrous large losses are weighted quite heavily, as they should be. They may have tiny probabilities, but we still need to consider them. For each possible outcome, we need to consider both the probability of the outcome and the weight of the outcome.

When we do the precise calculations using utility theory and integral calculus, the results are inconclusive. As we saw in the previous section, for iso-elastic utility functions with constant relative risk aversion, risk is independent of time, in the sense that the optimal asset allocation is the same at all time horizons. For other kinds of utility functions, risk may increase or decrease with time horizon, depending on the investor.

There is no reason to believe that all investors or even the mythical "typical" or "average" investor has any one particular kind of utility function.

To summarize, simple probability of shortfall is an inadequate measure of risk because it fails to take into account the magnitudes of the possible shortfalls and gains. When we attempt to correct this simple measure of risk by taking the magnitudes into account, we are led to utility theory, which tells us that there is no absolute sense in which we can claim that risk either increases or decreases with time horizon. All it tells us is that it depends on the individual investor, his current wealth, and his risk tolerance patterns as expressed by his particular utility function. For some investors, in this model we can say that risk increases with time. For others, risk decreases with time. For still others, risk is independent of time.

While this may seem inconclusive, and it is, there is one thing that we can conclude: The often-heard probability of shortfall argument in no way proves or even argues convincingly that time ameliorates risk. You should dismiss such arguments whenever you see them, and if you do any reading about investments at all, you will see them frequently. Don't be lulled into a false sense of security by these arguments.

The Option Pricing Theory Argument

Zvi Bodie, a Finance professor at Boston University, came up with an elegant argument that proves that risk actually increases with time horizon, at least for one reasonable definition of "risk." His argument uses the theory of option pricing and the famous Black-Scholes equation. This shouldn't scare the reader, though, because Bodie's argument is really quite simple and easy to understand, and we're going to give a real life example later that doesn't involve any fancy math at all.

Suppose we have a portfolio currently invested in the stock market for some time horizon. One of our alternatives is to sell the entire portfolio and put all of our money into a risk-free zero-coupon US Treasury bond which matures at the same time horizon. (A "zero-coupon" bond pays all of its interest when it matures. This is commonly used as the standard risk-free investment over a fixed time horizon, because all of the payoff occurs at the end of the time period, and the payoff is guaranteed by the US government. This is as close to "risk-free" as we can get in the real world.)

It is reasonable to think of the "risk" of our stock investment being not making as much money as we would with the bond. If we accept this notion of "risk," it then makes sense to measure the magnitude of the risk as being the cost of an insurance policy against a possible shortfall. That is, if someone sells us such a policy, and if at our time horizon we haven't made as much money in the stock market as we would have made with the bond, then the insurer will make up the difference.

This insurance policy is nothing more or less than a European put option on our stock portfolio. The strike price of the option is the payoff of the bond at the end of our time period. The expiration date of the option is the end of our time period. In fact, put options are frequently used in the real world for exactly this kind of "portfolio insurance."

If you plug all the numbers into the Black-Scholes equation for pricing European put and call options, you end up with a very simple equation in which it is clear that the price of the put option increases with time to expiration. In fact, one of the first things students of options learn is the general rule that the price of an option (put or call) increases with time to expiration. It turns out this is even true when we let the strike price increase over time at the risk-free rate. We have taken the price of our put option to be our measure of the magnitude of the risk of our stock investment. Thus, with this model, risk increases with time.

Let's work this argument out in a more personal way in the hope that it will clarify it and perhaps make the reader do some serious thinking about the issue. Suppose for the sake of discussion that you agree with the conventional wisdom that risk decreases with time horizon. In that case, if you were in the business of selling portfolio insurance, would you offer discounts on your policies for longer time horizons? If you really believe in your opinion, then you should be willing to do this, shouldn't you? For example, you should be willing to sell someone an insurance policy against a shortfall after ten years for less money than you'd be willing to sell someone else an otherwise identical policy against a shortfall after one year. After all, according to your beliefs, your risk of having to pay off on the policy is smaller after ten years than it is after one year.

If this is really how you feel, and if you agree with the scenario outlined in the previous paragraph, then you must also feel that the Black-Scholes equation is wrong. Perhaps Black, Scholes, and Merton made some horrible mistake in their derivation of the equation. Not a small mistake either - they must have reversed a sign somewhere! If this is the case, we'd better run over to the Chicago Board Options Exchange and let all those option traders with their Black-Scholes calculators know that they've been doing it wrong all these years. (Maybe they'd get it right if they all held their calculators upside down to read the answers. :-)

We must emphasize that this is not just some arcane theory with no practical application. Option traders buy and sell this kind of portfolio insurance in the form of put options every day in the real life financial markets.

As a concrete example which we'll examine in some detail, let's look at what insurance policies are selling for today, on April 9, 2000. We'll look at CBOE put options on the S&P; 500 stock market index and compare short-term prices for June 2000 options against longer-term prices for December 2001 options.

The S&P; 500 index is currently at 1,516. Current interest rates are about 6%. On June 17, 2000, 2.3 months from now, 1,516 would grow to 1,533 at 6% interest. On December 22, 2001, 20.4 months from now, 1,516 would grow to 1,675 at 6% interest. (If you're wondering where these exact dates come from, options on the CBOE expire on the first Saturday following the third Friday of each month.)

According to today's quotes on the CBOE web site, June 2000 put options on the S&P; 500 with a strike price of 1,533 are selling at about $58. December 2001 put options on the S&P; 500 with a strike price of 1,675 are selling at about $188. (I had to do a bit of mild interpolation to get these numbers, but whatever small errors were introduced do not significantly affect our example.)

To make the example even more concrete, let's suppose you currently have $151,600 invested in Vanguard's S&P; 500 index fund. If you wanted to buy an insurance policy against your fund earning less money than you could in a bank CD or with a US Treasury bill at 6% interest, you could easily call your broker or log on to your online trading account and buy such a policy in the options market. For a 2.3 month time horizon, you would have to pay $5,800 for your policy. For a 20.4 month time horizon, you would have to pay $18,800. (Plus a juicy commission for your broker or online trading company, of course, but we'll ignore that unpleasant detail.)

Thus, right now, professional option traders apparently believe that the risk of the S&P; 500 under-performing a 6% risk-free investment is more than 3 times greater over a 20.4 month horizon than it is over a 2.3 month horizon. This is not some quirk of our example that's only true today or with our specific numbers and dates. In the real life options market this basic phenomenon of portfolio insurance policies costing significantly more for longer time horizons is virtually always true.

In this example, if you believe that the risk of investing in the S&P; 500 decreases with time horizon, and in particular that there's less risk over 20.4 months than there is over 2.3 months, there are only three possibilities:

  1. You are wrong.
  2. Professional option traders are wrong.
  3. There's something wrong with Bodie's simple definition of "risk."

Which is it? It makes you think, doesn't it? Of all the arguments presented so far, I find this one the most convincing.

For the original complete argument see Bodie's paper titled "On the Risk of Stocks in the Long Run" in the Financial Analysts Journal, May-June 1995. You can also find a version of the argument with more of the mathematics than I presented here, plus a graph showing risk increasing over time, in section 6.8 of my paper Random Walks.

Human Capital

The only argument I find valid for the popular opinion that the risk of investing in stocks decreases with time horizon, and in particular the popular opinion that young people should have more aggressive portfolios, involves something called "human capital."

"Human capital" is simply the Economist's fancy term for all the money you will earn at your job for the rest of your working life (discounted to present value using an appropriate discount rate, but we needn't go into those details here.)

The basic argument is that retired people who obtain living expenses from the earnings of their investment portfolios cannot afford as much risk as younger people with long working lives ahead of them and the accompanying regular paychecks.

In this model, the older you get, the less working years you have left, and the smaller your human capital becomes. Thus as you age and get closer to retirement, your investment portfolio should gradually become more conservative.

Note, however, that most retirees do not obtain 100% of their income from investment portfolios. Social security benefits, pensions, and annuities provide steady income streams for many retirees. For our purposes, these guaranteed sources of regular income are no different than the regular income received from a paycheck prior to retirement. Any complete treatment of the issue of adjusting the aggressiveness of a portfolio before and after retirement must take these sources of income into account.

Note also that this argument doesn't work well in some cases. For example, an aerospace engineer whose entire investment portfolio consists of stock in his employer's company is playing a risky game indeed. Similarly, people employed in the investment world would suffer from a high correlation between their human capital and their portfolio, and they might be well-advised to be a bit more conservative than other people of the same age working outside the investment world.

These issues are complicated. Modeling them formally involves a complete life cycle model that takes into account income, consumption, and investment both before and after retirement, and treats human capital and other sources of income as part of the risk aversion computation machinery for determining optimal portfolio asset allocations. I don't pretend to understand all the math yet (it's more of that horribly complicated stuff Merton does), but I hope to some day!

Conclusion

Nearly everyone shares the "obvious" opinion that if you have a longer time horizon, you can afford to and should have a more aggressive investment portfolio than someone with a shorter time horizon. Indeed, it's difficult to visit any web site on investing or read any article or book on investing without being reminded of this "fact" by all sorts of pundits, experts, and professionals, using all kinds of fancy and convincing charts, graphs, statistics, and even (in the case of the web) state of the art interactive Java applets! (For an example of a Java applet, see Time vs. Risk: The Long-Term Case for Stocks [ed -- link to www.smartmoney.com/ac/retirement/investing/index.cfm?story=timerisk hasn't worked for years, maybe even decades] at the SmartMoney web site. It's a great example of the fallacy of time diversification in action.)

On close examination, however, we discover that most of the arguments made in support of this opinion, on those occasions when any argument other than "common sense" is given at all, are either fallacious or at best highly suspect and misleading.

The more we learn about this problem and think about it, the more we come to realize that it's possible that the situation isn't as obvious as we had thought, and that perhaps "common sense" isn't a reliable road to the truth, as is often the case in complex situations which involve making decisions under conditions of uncertainty. In these situations we must build and test models and use mathematics to derive properties of the models. The fact that our mathematics sometimes leads to results which we find counter-intuitive is not sufficient reason to discard the results out of hand. People who have studied a significant amount of math or science will not find it surprising that the truth is often counter-intuitive. Others find this more difficult to accept, but accept it they must if they wish to take these problems seriously.

We have seen at least one compelling argument (Bodie's option pricing theory argument) that the opposite of the commonly held belief is true: If we assume an entirely reasonable definition of the notion of "risk," the risk of investing in volatile assets like stocks actually increases with time horizon!

The only argument supporting the conventional wisdom that survives close examination is the one relying on human capital. Younger people with secure long-term employment prospects may in some circumstances have good reason to be somewhat more aggressive than older people or those with less secure employment prospects.

In any case, the most commonly heard arguments which rely on the fallacy of time diversification or which use probability of shortfall as a risk measure are clearly flawed and should be ignored whenever they are encountered, which is, alas, all too frequently.

Appendix - A Better Bar Chart Showing Risk Over Time

This chart shows the growth of a $1000 investment in a random walk model of the S&P; 500 stock market index over time horizons ranging from 1 to 40 years. It pretty much speaks for itself, I hope - that was the intention, anyway.

The chart clearly shows the dramatic increasing uncertainty of an S&P; 500 stock investment as time horizon increases. For example, at 40 years, the chart gives only a 2 in 3 chance that the ending value will be somewhere between $14,000 and $166,000. This is an enormous range of possible outcomes, and there's a significant 1 in 3 chance that the actual ending value will be below or above the range! You can't get much more uncertain than this.

As long as we're talking about risk, let's consider a really bad case. If instead of investing our $1000 in the S&P; 500, we put it in a bank earning 6% interest, after 40 years we'd have $10,286. This is 1.26 standard deviations below the median ending value of the S&P; 500 investment. The probability of ending up below this point is 10%. In other words, even over a very long 40 year time horizon, we still have about a 1 in 10 chance of ending up with less money than if we had put it in the bank!

Look at the median curve - the top of the purple rectangles, and follow it with your eye as time increases. You see the typical geometric growth you get with the magic of compounding. Imagine the chart if all we drew was that curve, so we were illustrating only the median growth curve without showing the other possible outcomes and their ranges. It would paint quite a different picture, wouldn't it? When you're doing financial planning, it's extremely important to look at both return and risk.

There's one problem with this chart. It involves a phenomenon called "reversion to mean." Some (but not all) academics and other experts believe that over long periods of time financial markets which have done better than usual in the past tend to do worse than usual in the future, and vice-versa. The effect of this phenomenon on the pure random walk model we've used to draw the chart is to decrease somewhat the standard deviations at longer time horizons. The net result is that the dramatic widening of the spread of possible outcomes shown in the chart is not as pronounced. The +1 standard deviation ending values (the tops of the bars) come down quite a bit, and the -1 standard deviation ending values come up a little bit. The phenomenon is not, however, anywhere near so pronounced as to actually make the +1 and -1 standard deviation curves get closer together over time. The basic conclusion that the uncertainty of the ending values increases with time does not change.

For those who might be interested in how I created the chart, here's the details:

I first got historical S&P; 500 total return data from 1926 through 1994 from Table 2.4 in the book Finance by Zvi Bodie and Robert Merton.

I typed the data into Microsoft Excel, converted all the simply compounded yearly returns into continuous compounding (by taking the natural logarithm of one plus each simply compounded return), and then computed the mean and the standard deviation. I got the following pair of numbers "mu" and "sigma":

mu = 9.7070% = Average annual continuously compounded return. This corresponds to an average annual simply compounded return of 12.30% (the arithmetic mean) and an average annualized simply compounded return of 10.19% (the geometric mean).

sigma = 19.4756% = The standard deviation of the annual continuously compounded returns.

(The steps up to this point were actually done a long time ago for other projects.)

I then used Excel to draw the bar chart. At t years, the -1 standard deviation, median, and +1 standard deviation ending values were computed by the following formulas:

-1 s.d. ending value = 1000*exp(mu*t - sigma*sqrt(t))

median ending value = 1000*exp(mu*t)

+1 s.d. ending value = 1000*exp(mu*t + sigma*sqrt(t))

I then copied the chart from Excel to the drawing module in the AppleWorks program and used AppleWorks to annotate it and save it as a GIF file that I could use in this web page.

Windows: a software engineering odyssey

2000-01-01 08:00:00

This is a text transcription of the slides from the "Windows: a software engineering odyssey" talk given on Microsoft culture by Mark Lucovsky in 2000. This is hosted here because I wanted to link to the slides, but the only formats available online were powerpoint and slide-per-page HTML where each page is basically a screenshot of a powerpoint slide. If you're looking for something on current Microsoft culture, try these links.

Agenda

  • History of NT
  • Design Goals/Culture
  • NT 3.1 vs. Win2k
  • The next 10 years

NT timeline: first 10 years

  • 2/89: Coding begins
  • 7/93: NT 3.1 ships
  • 9/94: NT 3.5 ships
  • 5/95: NT 3.51 ships
  • 7/96: NT 4.0 ships
  • 12/99: NT 5.0 a.k.a. Windows 2000 ships

Unix timeline: first 20 years

  • 69: coding begins
  • 71: first edition -- PDP 11/20
  • 73: fourth edition -- rewritten in C
  • 75: fifth edition -- leaves Bell Labs, basis for BSD 1.x
  • 79 -- one of the best
  • 82 System III
  • 84 4.2 BSD
  • 89 SVR4 unification of Xenix, BSD, System V
    • NT development begins

History of NT

  • Team forms 11/89
  • Six guys from DEC
  • One guy from MS
  • Built from the ground up
    • Advanced PC OS
    • Designed for desktop & server
    • Secure, scalable, SMP design
    • All new code
  • Schedule: 18 months (only missed our date by 3 years)

History of NT, cont.

  • Initial effort targeted at Intel i860 code-named N10, hence the name NT which doubled as N-Ten and New Technology
  • Most dev done on i860 simulator running OS/2 1.2
  • Microsoft built a single board i860 computer code-named Dazzle, including the supporting chipset; ran full kernel, memory management, etc. on the machine
  • Compiler came from Metaware with weekly UUCP updates sent to my Sun-4/200
  • MS wrote a PE/Coff linker and a graphical cross debugger

Design longevity

  • OS code has a long lifetime
  • You have to base your OS on solid design principles
  • You have to set goals; not everything can be at the top of the list
  • You have to design for evolution in hardware, usage patterns, etc.
  • Only way to succeed is to base your design on a solid architectural foundation
  • Development environments never get enough attention

Goal setting

  • First job was to establish high level goals
    • Portability: ability to target more than one processor, avoid assembler, abstract away machine dependencies. Purposely started the i386 port very late to avoid falling into a typical Microsoft x86 centric design
    • Reliability: nothing should be able to crash the OS. Anything that crashes the OS is a bug. Very radical thinking inside MS considering Win16 was co-operative multi-tasking in a single address space, and OS/2 had similar attributes with respect to memory isolation
    • Extensibility: ability to extend OS over time
    • Compatibility: with DOS, OS/2, POSIX, or other popular runtimes; this is the foundation work that allowed us to invent windows two years into NT OS/2 development
    • performance: all of the above are more important than raw speed!

NS OS/2 design workbook

  • Design of executive captured in functional specs
  • Written by engineers, for engineers
  • Every functional interface was defined and reviewed
  • Small teams can do this efficiently
    • Making this process scale is an almost impossible challenge
    • Senior developers are inundated with spec reviews and the value of their feedback becomes meaningless
    • You have to spread review duties broadly and everyone must share the culture

Developing a culture

  • To scale a dev team, you need to establish a culture
    • Common way of evaluating designs, making tradeoffs, etc.
    • Common way of developing code and reacting to problems (build breaks, critical bugs, etc.)
    • Common way of establishing ownership of problems
  • Goal setting can be the foundation for the culture
  • Keeping culture alive as a team grows is a huge challenge

The NT culture

  • Portability, reliability, security, and extensibility ingrained as the teams top priority
    • Every decision was made in the context of these design goals
  • Everyone owns all the code, so whenever something is busted anyone has a right and a duty to fix it
    • Works in small groups (< 150 people) where people cover for each other
    • Fails miserably in large groups
  • Sloppiness is not tolerated
    • Great idea, but very difficult to nurture as group grows
    • Abuse and intimidation gets way out of control; can't keep calling people stupid and except them to listen
  • A successful culture has to accept that mistakes will happen

NT 3.1 vs. Windows 2000

  • Dev teams
  • Source control
  • Process management
  • Serialized development
  • Defects

Development team

  • NT 3.1
    • Starts small (6), slowly grows to 200 people
    • NT culture was commonly understood by all
  • Windows 2000
    • Mass assimilation of other teams into the NT team
    • NT 4.0 had 800 developers, Windows 2000 had 1400
    • Original NT culture practiced by the old timers in the group, but keeping the culture alive was difficult due to growth, physical separation, etc.
    • Diluted culture leads to conflict
      • Accountability: I don't "own" the code that is busted, see Mark!
      • reliability vs. new features
      • 64-bit portability vs. new features

Source control system (NT 3.1)

  • Internally developed, maintained by a non-NT tools team
    • No branch capability, but not needed for small team
  • 10-12 well isolated source "projects", 6M LOC
  • Informal project separation worked well
    • minimal obscure source level dependencies
  • Small hard drive could easily hold entire source tree
  • Developer could easily stay in sync with changes made to the system

Source control system (Windows 2000)

  • Windows team takes ownership of source control system, which is on life support
  • Branch capability sorely needed, tree copies used as substitutes, so merging is a nightmare
  • 180 source "projects", 29M LOC
  • No project separation, reaching "up and over" was very common as developers tried to minimize what they had to carry on their machines to get their jobs done
  • Full source base required about 50Gb of disk space
  • To keep a machine in sync was a huge chore (1 week to set up, 2 hours per day to sync)

Process management (NT 3.1)

  • Safe sync period in effect for 4 hours each day; all other times, the rule is check-in when ready
  • Build lab syncs during morning safe sync period, which starts a complete build
    • Build breaks are corrected manually during the build process (1-2 breaks were normal)
  • Complete build time is 5 hours on 486/50
  • Build is boot tested with some very minimal testing before release to stress testing
    • Defects corrected with incremental build fixed
  • 4pm, stress testing on ~100 machines begins

Process management (Windows 2000)

  • Developers not allowed to change source tree without explicit, email/written permission
    • Build lab manually approves each check-in using a combination of email, web, and a bug tracking database
  • Build lab approves about 100 changes each day and manually issues the appropriate sync and build commands
    • Build breaks are corrected manually; when they occur, all further build processing is halted
    • A developer that mistypes a build instruction can stop the build lab, which stops over 5000 people
  • Complete build time is 8 hours on 4-way PIII Xeon 550 with 50Gb disk and 512k cache
  • Build is boot tested and assuming we get a boot, extensive baseline testing begins
    • Testing is a mostly manual, semi-automated process
    • Defects occurring in the boot or test phase must be corrected before the build is "released" for stress testing
  • 4pm, stress testing on ~1000 machines begins

Team size

Product Devs Testers
NT 3.1 200 140
NT 3.5 300 230
NT 3.51 450 325
NT 4.0 800 700
Win2k 1400 1700

Serialized Development

  • The model from NT 3.1 to 2000
  • All developers on team check in to a single main line branch
  • Master build lab syncs to main branch and builds releases from that branch
  • Checked in defect affects everyone waiting for results

Defect rates and serialization

  • Compile time or run time bugs that occur in a dev's office only affect that dev
  • Once a defect is checked in, the number of people affected by the defect increases
  • Best devs are going to check in a runtime or compile time mistake at least twice a year
  • Best devs will be able to code with a checked in compile time or run time break very quickly (20 minutes end-to-end)
  • As the code base gets larger, and as the team gets larger, these numbers typically double

Defect rates data

  • With serialized development
    • Good, small, teams operate efficiently
    • Even the absolute best large teams are always broken and always serialized
Product Team # Defects/dev-yr Fix time / defect Defects / day Total fix time
NT 3.1 200 2 20m 1 20m
NT 3.5 300 2 25m 1.6 41m
NT 3.51 450 2 30m 2.5 1.2h
NT 4.0 800 3 35m 6.6 3.8h
Win2k 1400 4 40m 15.3 10.2h

Dev environment summary

  • NT 3.1
    • Fast and loose; lots of fun & energy
    • Few barriers to getting work done
    • Defects serialized as parts of the process, but didn't stop the whole machine; minimal downtime
  • Windows 2000
    • Source control system bursting at the seams
    • Excessive process management serialized the entire dev process; 1 defect stops 1400 devs, 5000 team members
    • Resource required to build a complete instance of NT were excessive, giving few developers a way to be sucessful

Focused fixes

  • Source control
  • Source code restructuring
  • Make the large team work like a set of small teams
    • Windows is already organized into reasonable sized dev teams
    • Goal is to allow these teams to work as a team when contributing source code changes rather than as a group of individuals that happen to work for the same VP
    • Parallel development, team level independence
  • Automated builds

Source control system

  • New system identified 3/99 (SourceDepot)
  • Native branch support
  • Scalable high speed client-server architecture
  • New machine setup 3 hours vs. 1 week
  • Normal sync 5 minutes vs. 2 hours
  • Transition to SourceDepot done on live Win2k code base
  • Hand built SLM -> SourceDepot migration system allowed us to keep in sync with the old system while transitioning to SourceDepot without changing the code layout.

Source code restructuring

  • 16 depots for covering each major area of source code
  • Organization is focused on:
    • Minimizing cross project dependencies to reduce defect rate
    • Sizing projects to compile in a reasonable about of time
    • To build a project, all you need is the code for that project and that public/root project
    • Cross project sharing is explicit

New tree layout

  • The new tree layout features
    • Root project houses public
    • 15 additional projects hang off the root
    • No nested projects
    • All projects build independently
    • Cross project dependencies resolved via public, public/internal usnig checked in interfaces

Team level independence

  • Each team determines its own check-in policy, enable rapid, frequent check ins
  • Teams are isolated from mistakes by other teams
    • When errors occur, only the tema causing the error is affected
    • A build, boot, or test break only affects a small subset of the product group
  • Each team has their own view of the source tree, their own mini build lab, and builds and entire installable build
  • Any developer with adequate resources can easily duplicate a mini build lab
    • Build and release a completely installable Windows system
  • Teams integrate their changes into the "main" trunk one at a time, so there is a high degree of accountability when something goes wrong in "main"
  • Build breaks will happen, but they are easily localized to the branch level, not the main product codeline
  • Teams are isolated from mistakes made by other teams
    • When errors occur, they affect smaller teams
    • A build, boot, or test break only affects a small subset of the Windows development team
  • Each team has their own view of the source tree and their own mini buikld lab
    • Each team's lab is enlisted in all projects and builds all projects
    • Each team needs resources able to build an NT system
  • Each team's build lab builds, tests, and mini-bvt's a complete standalone system

Automated builds

  • Build lab runs 100% hands off
  • 10am and 10pm full sync and full build
    • Build failures are auto detected and mailed to the team
    • Sucessful builds are automatically released with automatic notification to the team
  • Each VBL can build:
    • 4 platforms (x86 fre/chk, ia64 fre/chk) = 8 builkds/day, 56/week
    • No manual steps at all
    • 7 VBLs in Win2k group
    • Majority of builds work, but failures when they occur are isolated to a single team

Productivity gains

  • Developers can easily switch from working on release N to release N+1
  • Developers in one team will not be impacted by mistakes/changes made by other teams
  • Developers have long, frequent checkin windows (Base team has 24x7 checkin window with manual approval used during Win2k)
  • Source control system is fast and reliable
  • Testing is done on complete builds instead of assorted collections of private binaries
    • What is in the source control system is what is tested